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Preface 



Domain/OS Design Principles describes the architecture and design 
of Domain/OS, Apollo's workstation operating system. 

We've organized this manual as follows: 

Part 1 Introduces the origins of Domain/OS 

and discusses the fundamental design 
principles we used to develop the sys- 
tem. 

Part 2 Contains technical papers by Apollo en- 

gineers that describe in greater detail 
some of these design principles. 

References for the material in each chapter are at the end of that 
chapter. 
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Documentation Conventions 

Unless otherwise noted in the text, this manual uses the following 
symbolic conventions. 



literal values 



user-supplied values 

sample user input 
output 

[ ] 

{ } 



< > 



CTRL/ 



Bold words or characters in formats and 
command descriptions represent com- 
mands or keywords that you must use 
literally. Pathnames are also in bold. 
Bold words in text indicate the first use 
of a new term. 

Italic words or characters in formats 
and command descriptions represent 
values that you must supply. 

In examples, information that the user 
enters appears in bold. 

Information that the system displays ap- 
pears in this typeface. 

Square brackets enclose optional items 
in formats and command descriptions. 
In sample Pascal statements, square 
brackets assume their Pascal meanings. 

Braces enclose a list from which you 
must choose an item in formats and 
command descriptions. In sample Pas- 
cal statements, braces assume their Pas- 
cal meanings. 

A vertical bar separates items in a list of 
choices. 

Angle brackets enclose the name of a 
key on the keyboard. 

The notation CTRL/ followed by the 
name of a key indicates a control char- 
acter sequence. Hold down <CTRL> 
while you press the key. 
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Horizontal ellipsis points indicate that 
you can repeat the preceding item one 
or more times. 

Vertical ellipsis points mean that irrele- 
vant parts of a figure or example have 
been omitted. 



This symbol indicates the end of a 
chapter. 
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Chapter 1 

The Origins of Domain/OS 



Domain/OS, which first shipped to customers in July 1988, is 
Apollo's workstation operating system software. It comprises a com- 
mon kernel and three environments, BSD, SysV, and Aegis™. 
BSD and SysV provide the two major UNIX* operating environ- 
ments and Aegis supplies Apollo's original operating environment. 

This chapter explores the context from which Domain/OS arose. It 
includes a brief overview of operating system architecture in general 
and of Domain/OS in particular, a discussion of some of the special 
operating system requirements of the workstation market, and some 
remarks about the role that the UNIX system played in the devel- 
opment of Domain/OS. 



* UNIX is a registered trademark of AT&T in the USA and other 
countries. 
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Architecture of a Workstation Operating System 

The purpose of any computer's operating system is twofold. First, it 
implements an abstract machine that is far more convenient to use 
than raw hardware. Second, it allocates, controls access to, and 
otherwise manages such physical computing resources as proces- 
sors, memory, and peripherals. 

For its users, however, an operating system should allow the most 
effective and efficient use possible of all the resources in the com- 
puting environment and give each of many users the illusion of ex- 
clusive use of the machine. For workstation users, these resources 
include CPU power, memory, network bandwidth, and disk space. 

However, the most important resource is clearly the users' time. 
For software developers, in particular, the operating system has to 
minimize development time, provide tools and mechanisms to fa- 
cilitate innovation and realize, as much as possible, the maximum 
power of the hardware. It must also leverage developers' efforts by 
ensuring portability for their software. 

New technology has changed the character of operating systems, as 
well. In recent years, we have seen the advent of inexpensive, high- 
speed local area networks, high-resolution bitmap displays, muki- 
MIP CPUs, multiprocessor hardware, and less expensive semicon- 
ductor memory and high-capacity disks. All of these have put new 
demands on traditional operating systems. 
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Domain/OS Overview 

Apollo's Domain/OS is a high-performance UNIX workstation op- 
erating system that is built on object-oriented principles. It effec- 
tively exploits the new hardware technology of the last few years, 
both in exclusively Apollo environments and in heterogeneous envi- 
ronments. The issue of heterogeneity has become increasingly im- 
portant, as fewer and fewer customers are willing, or even able, to 
restrict their sites to using computing equipment from a single 
manufacturer. 

Domain/OS consists of a common kernel with three operating envi- 
ronments. The Aegis environment provides all the functionality of 
the Aegis operating system, Apollo's original operating environ- 
ment, and the BSD and SysV environments provide users with en- 
hanced Berkeley Software Distribution 4.3 and AT&T System V 
Release 3 UNIX environments, respectively.* 

Each of these environments can run without relying on the others. 
The environments can also run concurrently, so that any Domain/ 
OS site can use two or three environments and enjoy a great deal of 
flexibility. By providing separate implementations of the two major 
UNIX development threads, rather than one "amalgamated" UNIX 
system, Domain/OS can track both standards as they evolve. The 
availability of the two UNIX environments also fulfills our custom- 
ers' needs for software portability. 

Domain/OS uses Apollo's Open System Toolkit™ (OST) to enable 
customers to extend the power of the operating system and to sup- 
port true distributed computing in multi-vendor environments with 
the Network Computing System. 

The Open System Toolkit provides tools that allows operating sys- 
tems programmers to create new types of I/O targets (that is, de- 
vices) without modifying the operating system's source code. The 
Open System Toolkit also includes facilities to add new object types 
to the system. 



* SysV is compatible with the System V Release 3 Interface Defini- 
tion (SVID) for Base OS, Base Libraries, and Library Extensions. 
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In traditional UNIX systems, when device drivers are written for 
new devices, a systems programmer must modify and rebuild the 
operating system source code to install the new driver. The OST 
approach makes it far simpler and cleaner to add device drivers to 
the operating system. Other concepts associated with the OST (ob- 
ject types, type managers, extended naming) make it significantly 
easier to customize the Domain/OS system to answer a particular 
site's needs. Chapter 2 discusses these concepts. 

One of Domain/OS's major strengths is the distributed file system, 
which provides transparent access to data anywhere in the network. 
In addition to having distributed data, however, Domain/OS offers 
true distributed computing in the form of the Network Computing 
Architecture. In addition to providing ways to make optimal use of 
computing resources, the Network Computing Architecture pro- 
vides ways to take advantage of parallel processing and specialized 
hardware as well. 

The Network Computing System™ is a portable implementation of 
the Network Computing Architecture that runs on both UNIX sys- 
tems and other systems. In addition to being object-oriented, the 
Network Computing System supplies a transport-independent re- 
mote procedure call facility. The system is built on a concurrent 
programming support package that allows multiple execution 
threads in a single address space and also contains a replicated 
global location database for objects. 
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The Role of the UNIX Operating System 

Traditionally, each computer manufacturer designed its own oper- 
ating system to meet the needs of both its own customers and its 
internal software developers. Recently, however, it has become 
clear that one of the best ways to maximize the productivity of soft- 
ware developers and to provide the customer some measure of ven- 
dor independence is to make the operating system's services avail- 
able through interfaces that are standard across many manufactur- 
ers' equipment. In the workstation world (as in the minicomputer, 
and, increasingly, the mainframe worlds), the standard is defined 
by the UNIX operating system. 

More than a standard set of interfaces, however, the UNIX system 
is a framework into which new interfaces will be incorporated, as 
well as a context and a common language for discussing new operat- 
ing system ideas. The style already established by existing UNIX 
features will, to a large degree, determine the shape and feel of 
future operating system functionality. 

Domain/OS provides, in addition to the UNIX interfaces, compat- 
ible extensions. One example of these extensions is the Domain/OS 
protection system, which provides all the functionality of the tradi- 
tional UNIX modes, but extends them by providing more flexible 
and more finely grained protection levels. These extensions are en- 
tirely optional and do not interfere if users wish to run a "pure" 
UNIX environment on Apollo hardware. 



UNIX Compatibility — Implications 

It is difficult to overestimate the effects of having the UNIX operat- 
ing system as a standard. Many types of application software are 
being developed exclusively, or at the very least, first, for the UNIX 
system because of the tremendous leverage the system provides in 
making a developer's software available to a wide audience. Many 
new standards (in windowing systems, for example) are being devel- 
oped for UNIX systems, and are therefore coming into widespread 
use much more rapidly than they would otherwise. 

But the notion of UNIX compatibility extends beyond the program- 
ming interfaces to the areas of performance and user environment. 
Developers porting UNIX code expect the relative performance of 
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various system facilities to be more or less the same from one UNIX 
system to another. 

Furthermore, UNIX systems provide a user environment that is 
fairly consistent from machine to machine. As a result, program- 
mers, system administrators, and other end users all find that many 
of their skills and work habits carry over from one UNIX system to 
another. This allows organizations that run or develop software to 
make productive use of newly hired employees more quickly. 

So the first requirement for a competitive operating system in the 
workstation market is UNIX compatibility in all aspects. In its 
UNIX environments, Domain/OS provides both the interfaces and 
the feel of a UNIX system. In addition, Apollo continues to use the 
UNIX software as a base for implementing compatible extensions, 
and proposing those extensions within such standards bodies as the 
IEEE POSIX group, the Open Software Foundation (OSF), and in 
the UNIX community at large. 



The UNIX Philosophy 

The original developers of the UNIX system were programmers who 
had pragmatic ideas about how an operating system should be struc- 
tured and used. These ideas have seeped into the software develop- 
ment and engineering culture, and are known collectively and in- 
formally as the UNIX philosophy. The major points of this philoso- 
phy are generally agreed to be: 

• Provide just enough mechanism to get the job done, and 
no more. 

• Let each command do one thing well. 

• Don't try to do the user's job. Provide tools as building 
blocks, which can be arranged as necessary to perform a 
task. 

• The output of one command should be usable as the input 
of another command. 

• Files are unstructured byte streams. Applications may im- 
pose any internal structure, but to the system, a file is a 
file. 



1-6 Origins of Domain/OS 



Some aspects of this philosophy are no longer as relevant as they 
once were. As UNIX systems have become platforms for applica- 
tions more than just a means to support programming and word 
processing, we have seen the emergence of applications programs 
that approach (or even exceed) the operating system itself in size 
and complexity. Areas such as computer-aided drafting (CAD) and 
desktop publishing create self-contained, highly interactive, graph- 
ics-intensive environments in which the end user is largely uncon- 
cerned with the nature of the underlying operating system. 

However, one aspect of the UNIX philosophy, often implied but 
seldom stated explicitly, is still extremely relevant. It has to do with 
finding the proper relationship between generality and perform- 
ance. There are two parts to this: 

• Find the right balance between the current and future 
needs of the customer. 

• Provide what you can implement efficiently today. Don't 
offer functionality that demands more performance than 
the current technology can offer, because the functionality 
won't get used. 

However this point is stated, it is clear that trying to achieve these 
goals is the essence of skillful engineering. It is also clear that the 
technology available today makes it practical (and even necessary) 
to offer features that might have been too costly just a few years 
ago. 

While important, UNIX compatibility is by no means sufficient to 
solve all of today's operating system problems, and the desire to 
maintain compatibility with the UNIX system should not obviate the 
possibility of improving and extending it. 



Beyond UNIX Compatibility 

Current UNIX implementations fail in several ways to provide 
maximum value to users. This is not surprising. Although the UNIX 
kernel has evolved considerably since its original implementation in 
the early 1970s, its basic characteristics have not changed in any 
fundamental way. 

At the time of its inception, most of the machines available were 
minicomputers with low-powered (by today's standards) CPUs, 
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limited physical memory, small disks, and little or no networking 
capability. The expense of the hardware dictated that these ma- 
chines be timeshared. 

As time has passed, much new functionality has been added to the 
UNIX system, but most of it has been added to the kernel, making 
it quite bloated. As a result, the kernel has lost much of its original 
elegance and simplicity, and is no longer the best match for current 
technology. 

Other developers have begun to recognize the limitations of the 
traditional UNIX kernel [1], and are exploring new ways of struc- 
turing the operating system. They hope to demonstrate that an op- 
erating system can realize the benefits of the UNIX system without 
being subject to the constraints of a monolithic kernel. 

It is possible to enumerate a number of principles and general prop- 
erties that such a restructured UNIX system should possess, and 
Domain/OS already embraces many of these principles. In the next 
chapter, we discuss in more detail the design principles underlying 
the Domain/OS operating system. 
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References 

[1] R. Rashid. Threads of a New System, UNIX Review, 4, 

No. 8 , pp. 37-49. August, 1986. 
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Chapter 2 

Domain/OS Design Principles 



Domain/OS is a complex operating system. It employs several inter- 
esting design principles to manage that complexity and to avoid 
some of the pitfalls of traditional, monolithic UNIX kernels. 
Among these principles are: 

• Object orientation 

• A small kernel 

• System functionality in user space 

• Dynamic loading, linking, and sharing of system libraries 

The goals of Domain/OS's design include: 

• Support for very large virtual address space processes 

• More efficient use of physical memory 

• Single-level store and transparent object location 

• Network-wide access to file system objects through virtual 
memory management 

• Scaling to many machines and/or users 
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• Extensibility by users 

• Support for multiple operating system environments and 
multiple processors 

The sections that follow expand on these fundamental Domain/OS 
design principles and goals. 



Object Orientation 

Briefly, under the object-oriented model, the world consists of a 
collection of opaque abstract objects. The behavior of these ob- 
jects, but not their internal implementation, can be discerned by 
other objects. The state of an object can be altered or observed 
only through the interfaces (operations) that the object exports to 
the rest of the system. 

Because the behavior of one object does not depend on the imple- 
mentation of another object, any object's internal implementation 
is free to change, as long as it still presents the same behavior to the 
rest of the world. 



Object-orientation in Domain/OS is largely by convention, since 
the language used to implement the kernel does not enforce infor- 
mation-hiding and data abstraction. This allows Domain/OS to vio- 
late the rules of the object-oriented model, when necessary, to 
avoid the expense of excess layering that is inherent in a pure ob- 
ject model. 
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System Functionality in User Space 



A general goal of the Domain/OS design was to minimize the size of 
the kernel by implementing system functionality in user space, 
where doing so would not compromise security or performance. 
This has several benefits: 



• It makes it easier to prototype, implement, and debug 
these facilities, because all of the interfaces and tools avail- 
able to any application program are accessible in user 
space. 

• In some cases, better performance results because traps to 
the kernel can be avoided. 

• It makes it possible to avoid paying the cost of functions 
that aren't necessary. User space libraries and servers that 
implement optional functionality need not be installed. 

• It allows users to substitute for or extend system function- 
ality without modifying the kernel. 

Domain/OS makes user-space implementation of system function- 
ality practical by providing some of the same mechanisms in user 
space that traditional kernels (including the UNIX kernel) depend 
on internally. 



The most important of these are a shared memory facility for 
cheap, high-bandwidth communication among processes and an in- 
expensive mutual exclusion mechanism, based on eventcounts. Ad- 
ditionally, a user-space cleanup mechanism is needed to make sure 
that user-space global state (see below) is properly cleaned up 
when a process dies, as well as an inexpensive way to defer asyn- 
chronous signals while critical sections of user-space code are being 
executed [5]. 

Most UNIX implementations do not have these facilities available 
outside the kernel. It is critical that such mechanisms be cheap, so 
that moving functionality out of the kernel does not incur a per- 
formance penalty. For example, the mutual exclusion call for enter- 
ing a critical section only executes six machine instructions (and no 
system calls) in the case where there is no contention for the mutual 
exclusion lock. 
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A typical example of these facilities in Domain/OS is Apollo's TCP/ 
IP implementation, which is implemented entirely in user space (us- 
ing kernel network device drivers) , and achieves a highly competi- 
tive performance. 

In addition to these facilities, Domain/OS includes the GPIO sub- 
system, which allows device drivers to be written, debugged, and 
installed entirely in user space. This makes the work of adding a 
new device driver to the system much simpler than in traditional 
architectures where a new device driver must be bound into the 
kernel. 



Perhaps what is most important about these design principles is not 
that they have been used internally to structure the operating sys- 
tem, but that they also form the basis for allowing users to extend 
the functionality of the system in ways that Apollo may not have 
anticipated. 

In particular, the object-oriented approach makes it possible for 
users to customize or extend the system without knowledge of how 
other portions of the system are implemented. This user exten- 
sibility is discussed in a later section. 
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Dynamic Loading, Linking, and Sharing of System 
Libraries 

UNIX applications can use services that are provided by the kernel 
as well as services provided by libraries. The kernel services are 
obtained by trapping into the kernel from user mode. The library 
services are obtained by binding the necessary routines into the 
executable image of the library. Thus, binding to kernel services is 
effectively done at system boot time, while binding to library serv- 
ices is done at program bind time. This scheme has three obvious 
disadvantages: 

• It wastes disk space, since copies of the library routines are 
duplicated in every command. 

• It increases the total working set of processes, since each 
process is accessing its own copy of the routines. 

• It creates a serious maintenance problem for third-party 
software vendors: if a manufacturer fixes bugs or enhances 
performance in the library routines, the software vendor's 
customers will not see those changes until the vendor re- 
builds and redistributes its applications to its customers. 

These problems are solved in Domain/OS by the use of shared li- 
braries. There are two kinds of shared libraries: global and private. 
The implementation of shared libraries relies on three basic mecha- 
nisms present in Domain/OS: position-independent code, the 
known global table, and global user address space. 



Position-Independent Code 

The Domain/OS compilers can generate position-independent code 
(PIC) . This has the effect of inserting an extra level of indirection 
for each external procedure and each reference to global data. For 
a procedure, this extra code is called a transfer vector, and is lo- 
cated in the read/write data section of a module. The transfer vec- 
tor simply jumps to the actual procedure entry point. 

This allows the procedure text of a shared library to be loaded any- 
where in an address space without relocation. The procedure text 
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can therefore be read-only, and thus be shared by all the processes 
that use the library. All load-time relocation takes place in the data 
area, which already must be writeable anyway. 



Known Global Table 

The Known Global Table keeps track of all the symbols exported 
by the libraries currently installed. It is consulted whenever a pro- 
gram that has unresolved external references is loaded. Any such 
references that are found are filled in with the symbol's value from 
the Known Global Table. 



Virtual Address Space Layout 

Domain/OS supports the concept of global user address space. Ac- 
tually, it divides the address space into four partitions: global super- 
visor, global user, private supervisor, and private user. Global ad- 
dress space is shared among all processes, with global supervisor 
space being the portion where the kernel text and data are mapped. 
It is readable and writeable only when the system is executing ker- 
nel code as a result of a system call, an interrupt, or a context 
switch to a kernel-only process. Global user space, on the other 
hand, is accessible to any user space code, just like the more tradi- 
tional private user portion of the address space. 



Global Libraries 

Global shared libraries are installed into global user space once, at 
boot time. Each node has a system configuration file that specifies 
which libraries are to be treated as global and which as private. All 
libraries specified in this file as either or global or private (the exact 
terms are global and shared) have their entry points stored in the 
global Known Global Table. 



When a program that makes calls to global library entry points is 
loaded, that program's transfer vectors are filled in with the values 
found in the Known Global Table, and the program is then ready to 
run. 
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Many functions traditionally found in the kernel are implemented 
in the global libraries. For example, the entire device-independent 
I/O layer (i.e., open, close, read, write) is in a user-space global 
library. 



Private Libraries 

In addition to being marked as shared in the system configuration 
file, private libraries can be marked as either static or dynamic. 
Marking a library static means that the library is loaded when a 
program containing a reference to the library is loaded; marking it 
dynamic means that the library is loaded when a program actually 
executes a call on the library. 



The configuration file also allows you to designate libraries as op- 
tional; no error is reported if the library is not found. A default set 
of libraries is always loaded if the configuration file is not present. 
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Support for Large Virtual Address Space Processes 

Traditionally, UNIX systems require an area of the disk or disks to 
be preallocated as swap space. This space can be used only for 
swapping or paging, and cannot hold regular file system objects. 
The sum of the sizes of the virtual address spaces of all current 
processes must not exceed the amount of swap space. Thus, for 
example, if 8 megabytes of swap space are allocated, no process (or 
combination of active processes) can use more than 8 megabytes of 
virtual memory. 

If the preconfigured swap space is too small, the system must be 
reconfigured. That process generally requires dumping all the file 
system objects on that physical volume to tape, rebuilding all the 
file systems, and then restoring the data from tape. 

With the advent of processors that support huge virtual address 
spaces and applications that use them, these limitations are clearly 
no longer practical. They impose arbitrary limits on process virtual 
address space size and are an inefficient use of disk space. 

Domain/OS allocates backing store for disk space dynamically, as it 
is requested by a process. Disk space does not have to be desig- 
nated in advance as file space or swap space. Thus, the size of a 
process's address space is limited only by what the processor can 
support and by the total amount of free space on the logical vol- 
ume. This means, for example, that when a process performs an 
sbrk() operation to increase the size of its data area, it can obtain 
the space needed to back up the newly added pages of address 
space from anywhere on the file system volume. 
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More Efficient Use of Physical Memory 

Another problem related to the size of a process's virtual address 
space size is that current UNIX kernels generally require that all the 
page tables for a process reside in physical memory while the proc- 
ess is active. This represents an inefficient use of physical memory, 
and adds another unnecessary constraint on process virtual address 
space size. 



Domain/OS can swap the page tables for an active process out to 
the disk. This means that there can be a fixed upper limit on the 
amount of physical memory devoted to page tables, with no effect 
on the amount of virtual address space a process can use. If all of 
the currently active processes have relatively modest combined vir- 
tual memory requirements, then all of their page tables can fit in 
memory. In that case, no swapping takes place, and there is no 
impact on performance. 
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Single-Level Store 

Some operating systems divide storage into several levels. A ma- 
chine's main memory acts as the primary storage level, while the 
disk acts as secondary storage. In this scheme, programs have direct 
access only to the primary storage level; they must explicitly copy 
data from secondary to primary storage before they can operate on 
it. 



Domain/OS uses a single-level storage mechanism, whereby a pro- 
gram gains access to an object by mapping object pages directly into 
the process's address space. With the single-level store, all objects 
in the network are accessed in the same way, regardless of whether 
they reside on the local disk or on another disk in the network. 
Users can share the same program and/or data file, and can exe- 
cute a program without regard for the location of files that it uses. 

Both disk and network I/O are implemented by way of demand 
paging. The file I/O manager maps file pages into the virtual mem- 
ory of a process. When a program attempts a read or write opera- 
tion, the file I/O manager starts at the current seek pointer location 
and copies the pertinent data from the place in the process's ad- 
dress space where the file has been mapped to the user's program 
buffer. (A program can also be set to perform I/O in a mode that 
eliminates the data-copying step.) If the data requested is not in 
real memory, a page fault occurs. 

Thus, there is a direct mapping between object pages (regardless of 
where they reside on the network) and process virtual address 
space. With this direct mapping feature, processes can access ob- 
jects using programming language variables, arrays, strings, and 
other constructs. In addition, once the object is mapped into a 
process's virtual address space, the system does not demand-page 
any data until the process actually refers to it. Thus, processes can 
map the objects without excessive system overhead. 

Domain/OS makes more efficient use of physical memory by allow- 
ing all of it to be available as a cache over the file system. The 
system uses the same mapping and demand paging mechanism for 
program execution, as well. Because the demand paging mechanism 
operates transparently over the network, you can execute programs 
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Areas 



on nodes with or without disks, without additional special mecha- 
nism. 

Of course, Domain/OS provides the standard UNIX I/O interfaces: 
open, close, read, write, and seek. In addition, Domain/OS allows 
direct access to the mapped file interface for applications that re- 
quire the maximum possible I/O throughput. 



One exception to the single-level store mechanism lets Domain/OS 
create processes faster and less expensively. Instead of using a file 
system object to back up portions of virtual address space, we use a 
pseudo-object called an area. The area mechanism is based on the 
System V regions model, but since that term has another meaning 
specific to Domain/OS, we refer to this pseudo-object as an area. 

An area is a set of contiguous segments in a process's virtual ad- 
dress space. The process itself creates and maps the area. Other 
processes can map any portion of an area into their own address 
spaces. 

An area has a unique identifier (UID) by which it can be mapped, 
just as an object can. It appears to the rest of the operating system 
as a file system object, but without the overhead of creating, delet- 
ing, and manipulating. 

Areas let you avoid manipulating objects directly. Since all alloca- 
tion, deallocation, and growth operations are performed in mem- 
ory, there are no file maps to adjust and thus no disk buffering or 
other disk operations to perform. 
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Virtual Memory Management 

The Domain/OS virtual memory management scheme provides net- 
work-wide access to objects with two related operations: mapping 
and demand paging. 



Mapping — the Single-Level Store 

We discussed mapping in a previous section on the single-level 
store. Briefly, the single-level store allows a program to gain access 
to a file system object by mapping the object's pages into the proc- 
ess's address space. Once the object is mapped in, individual seg- 
ments of it are moved in and out of the address space by a mecha- 
nism called demand paging. 



Demand Paging 



In demand paging, the system dynamically transfers pages of an 
object in and out of physical memory, both locally and over the 
network. The object may reside on the local disk or on a remote 
node's disk. Each node has a remote paging server process to han- 
dle remote requests to read and/or write pages of objects that reside 
on that node. When another node references an object belonging 
to that node, the paging server dynamically transfers the data to the 
requesting node. 

The paging system on a node caches copies of pages that have been 
transferred to that node, so that any subsequent reference to the 
same page is very fast. A concurrency checking mechanism ensures 
that the cached pages are valid upon subsequent reference. 

In Domain/OS, both disk and network I/O are implemented by way 
of demand paging. The file I/O manager maps file pages into the 
virtual memory of a process. When a program attempts a read or 
write operation, the file I/O manager copies the pertinent data, 
starting at the current seek pointer location, from the place in the 
process's address space where the file has been mapped, to the 
user's program buffer. If the data requested is not in real memory, 
a page fault occurs. 
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Scaling to Many Machines and/or Users 

Unlike traditional UNIX implementations, Domain/OS anticipated 
the availability of high-speed local area networks from the time of 
its initial design. We felt that to truly exploit the current and future 
capabilities of such networks, the distributed file system supported 
by the system would have to adhere to two principles: 

• The design must scale well to very large networks, consist- 
ing of thousands of nodes. 

• Sharing of data without prior arrangement must be the de- 
fault. 



Simple Administration in Large Networks 

The first principle implies that administering the distributed file sys- 
tem would have to be extremely simple. Adding a new node or a 
new user to the network must be as simple as adding a new user on 
a timesharing system. The ability to have new nodes join the distrib- 
uted file system must not be limited by fixed size mount tables, nor 
must other machines have to perform any explicit action to access a 
new node. 



In Domain/OS, when a new node joins the network, it issues only 
one command to make its presence known to a network-wide nam- 
ing server. Thereafter, other nodes automatically learn the address 
of the new node when they attempt to refer to it by name. 



One of the most important benefits of these design decisions is that 
diskless nodes can be used and administered very simply and flex- 
ibly. Any workstation can boot diskless off any disked workstation 
in the network at any time, without prior arrangement. 

Nor does a diskless workstation require its own file system partition 
on its partner. Instead, it shares the root volume of the partner. 
Like every other node, it sees the same view of the name space and 
has full access to the file system, no matter who its partner is. 
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One other point concerning protection is worth noting. In support- 
ing very large user communities, one soon finds that the simple 
owner-group-world protection scheme offered by the UNIX system 
is not always flexible enough to carefully control the sharing of 
data. One often wants to grant or remove rights for certain persons 
or groups of people. Or one wants to have the protection placed on 
a file be determined not by who is creating the file, but by what 
directory it is being created in. 

To address this problem, Domain/OS optionally allows the exten- 
sion of the UNIX protection mechanism with Access Control Lists 
[3]. ACLs are ordered lists of subjects (persons, groups, and or- 
ganizations) and the rights granted to those subjects. Protection in- 
heritance for newly created files can be specified on a per-directory 
basis, and can be set to be "from process" (traditional UNIX se- 
mantics) or "from directory," where each directory has associated 
with it the initial protections to be applied to new files. 

If a user or system administrator makes no special arrangements, 
the default is for the system to provide nothing more than standard 
UNIX protection semantics. If the extra flexibility of ACLs is 
needed however, it can be used without needing to modify existing 
UNIX programs to deal with the new protection information. In 
other words, ACLs interact well with such UNIX system calls as 
open, creat, stat, and chmod. 

In addition, rather than having a copy of an /etc/passwd file on 
every node, a unified server-based account registry covers an entire 
network [6]. Users therefore have a network-wide identity. Adding 
a new user simply involves sending a series of messages to the regis- 
try servers via an editing tool. No files need be copied to every node 
in the network to make the new user known. 



This illustrates another implication of the desire to support large 
networks: the only viable way to deal with network-wide databases 
and services is to get at them through servers (possibly replicated), 
and if necessary, to keep local caches of those databases on each 
node. It is not practical to distribute full copies of such databases to 
every node each time the databases change. 
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Shared Data 



The second design principle of the Domain/OS distributed file sys- 
tem, sharing by default, implies a global uniform name space. The 
name space of the distributed file system appears to users like that 
of a giant timesharing file system. It is a traditional UNIX hierarchi- 
cal name space, except that absolute pathnames can begin with the 
name of the network root (called //) . It is also possible to express 
pathnames relative to the root of the local node (the / directory) . 

The network root is a database maintained by the network naming 
server. For efficiency, each node has its own cache of the network 
root. When the system tries to resolve a pathname beginning with 
//, it first looks in the local cache; if the node name is not found 
there, it consults the naming server. 



The important point is that no matter what node on the network a 
user sits at, a given file has the same pathname. In order to share 
public files, or other users' private files, no prior arrangements 
(such as mounting file systems) need be made. Access is controlled 
by the normal file system protection mechanisms, which apply net- 
work-wide. 



After using this facility for a while, the virtues of sharing become 
apparent. For example, it is routine within Apollo's 2500-node 
corporate network (spread over eight buildings and two states) for a 
user to send a mail message to a group of users that says "Please 
look at my proposal in //mynode/user/proposal, and record your 
comments in the file //mynode/user/comments." 

Recipients can then simply move a cursor onto the first pathname in 
the mail message, click a mouse button, and open an edit window 
displaying the contents of the file. They can then point at the sec- 
ond pathname, click, and edit the comments file. This ease of use 
greatly improves the efficiency and bandwidth of communications, 
especially among geographically dispersed groups. 
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Extensibility by Users 



An important aspect of an open architecture is the ability of system 
builders and users to extend the system's functionality. Such exten- 
sions represent as much of an investment as applications develop- 
ment. 



In traditional UNIX systems, functionality is extended by modifying 
the kernel. The most common kind of extension is the addition of 
device drivers for new devices. This requires a certain level of ex- 
pertise in kernel internals and, usually, a set of kernel source code. 



More seriously, there is no guarantee that when the vendor supplies 
the next release of the operating system, the user's extensions will 
still work. At the very least, any kernel changes will have to be 
reintegrated with the source and/or binaries provided by the ven- 
dor. 



Open System Toolkit 

Domain/OS solves this problem by its adherence to object-oriented 
structuring techniques. Each object in the file system is marked with 
a unique type identifier (type UID). I/O on each object type is han- 
dled by a different manager. When an object that is not one of the 
built-in types is opened, the device-independent I/O subsystem 
(the switch) locates the manager for that type, and dynamically 
loads it. It then calls a manager initialization routine which exports 
a vector of procedures implementing operations like open, close, 
read, write, and seek. 

If a manager supports other semantics, say those of a tty, it may 
export other sets of operations, such as those for setting erase and 
kill characters or for setting raw and cooked modes. Subsequently, 
all operations on the new file descriptor will be switched through to 
the manager's exported procedures. 

Apollo documents the interfaces expected by the switch and guar- 
antees that they'll remain the same from one software release to the 
next. These published interfaces and the tools for defining new 
types are known as the Open System Toolkit. Open System Toolkit 
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managers run in user space, and therefore have all the above-de- 
scribed facilities available to them for shared memory and synchro- 
nization. 

Developers can use the Open System Toolkit to add interesting new 
file types to the system, while applications that use these new types 
continue to work without change. A simple example might be a 
circular log file type. Another useful type might be one which main- 
tains all the versions of a text file under source version control. 
When an application opened a text file under version control, it 
would read the most recent version of the text. This obviates the 
need to perform a separate "fetch" operation before an application 
can look at a source module. The GPIO system can also combine 
with the Open System Toolkit to give standard access to new de- 
vices. 



Objects 

An object is a container, and each object has a unique identifier 
called a xoid associated with it. The operating system does not care 
what the contents of an object are. Objects that the system manipu- 
lates can "contain" such diverse things as ASCII files, printers, and 
tape drives. 



Applications programs generally want to perform certain kinds of 
functions on objects, like reading, writing, creating, and deleting. 
Traditionally, in the case of peripherals, an operating systems pro- 
grammer would write a device driver for a new class of device that 
the system would support, add the driver to the operating system 
source code, and then rebuild the system software. 

In Domain/OS, subroutine libraries supplied with the operating sys- 
tem perform basic operations like reading and writing on their asso- 
ciated type of object. These subroutine libraries are called type 
managers. Each type manager supports certain traits. A trait is an 
ordered set of the operations that can be performed on an object. 

Users can add new types of objects and their associated managers 
with the Open System Toolkit. Rather than having to rebuild and 
reboot the operating system, the user can install a new type man- 
ager with a single shell command. 
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Type managers can implement standard operations like read and 
write in any way appropriate. New system functionality can be 
added without disturbing system operation. Thus, the new function- 
ality is immediately available to all programs. 

Not all functions need be implemented for every type manager, of 
course. For example, the type manager for a line printer would 
probably not implement a function to move a file pointer. 



Type Managers and Traits 

The subroutine that implements the functions for a given object 
type is called a type manager. In Domain/OS, there are managers 
for physical resources like disks, network controllers, and memory, 
as well as for abstract concepts like files, processes, and address 
spaces. 



Since an object is only a container, each type of object on the sys- 
tem looks the same to an applications program. Because of this, the 
application can contain, say, a read call that is compiled into the 
program. At execution time, the user can redirect the program and 
the call to operate on any kind of object. For example, if a program 
attempts to open an object of the ASCII file type, the ASCII file 
type manager performs the appropriate functions. Thus, developers 
can create a group of general-purpose utilities that operate on all 
object types instead of creating and maintaining programs for each 
individual type on the system. 

Much of this is true of traditional UNIX implementations, but few 
let you easily add new types to the system. As a result, in most 
UNIX implementations, users cannot be sure that a new type added 
at one release level of the operating system will still operate cor- 
rectly with a new release. Under Domain/OS, new types will work 
correctly with later operating system releases. 

A trait represents a certain behavior that an object supports. Each 
trait is an ordered set of operations. An object supports a trait if the 
object's type manager implements the operations that define the 
trait. For every trait that a type manager supports, the manager 
provides a list of pointers to procedures that implement the opera- 
tions in the trait. For further details on the trait/type system, see the 
paper An Extensible I/O System in Part 2 of this book. 
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How Type Managers Are Loaded 

When an applications program written using the Open System 
Toolkit attempts an operation on a pathname, the system resolves 
the name of the object into a xoid and then ascertains the object 
type that the xoid identifies. If the manager for that type is not 
currently loaded, the system loads it from the trait/type database 
(which tracks which type managers are currently loaded) into the 
address space of the process that requested the object. 



A type manager is loaded only when a process demands it. Unlike a 
device driver, the manager does not need to be bound into the 
operating system in advance. Thus, a type manager doesn't con- 
sume system resources until it actually is loaded. 



Extended Naming 



One of the traits that a type manager may optionally support is 
known as extended naming. When the system name resolver en- 
counters a pathname component that is not a standard directory 
(called an "extended naming object") but unresolved pathname 
text still remains, the rest of the text is passed to the open operation 
of the type manager that supports the extended naming object type. 
The interpretation of the residual text depends entirely on how the 
manager is written. 

For example, a manager to control source versions could be written 
in such a way that trying to access the pathname /progs/main. c 
would give the application the most current version of the source 
file for editing and compiling, while specifying the pathname 
/progs/main. c/7 would provide version number 7, and naming 
/progs/main. c/-l would provide the penultimate version. 

A more typical use of extended naming is to provide gateways to 
non-Apollo file systems. For example, a Network File System 
(NFS*) mount point could be an extended naming object sup- 
ported by the NFS type manager. All pathname text beyond the 
mount point would be interpreted by the NFS type manager as be- 
ing relative to the root of the remote file system mounted at that 
point. 



1 NFS is a trademark of Sun Microsystems, Inc. 



Domain/OS Design Principles 2-19 



This mechanism provides a relatively simple way for users to build 
gateways to any kind of foreign file system to which they might want 
transparent access. All that is necessary is to write the type manager 
and a server to run on the foreign machine to handle remote file 
system requests. 



Network Computing System 

Another important way in which users can extend the functionality 
of the system is via the Network Computing Architecture (NCA) . 
The Network Computing Architecture is a framework for develop- 
ing distributed applications. It allows optimal use of computing re- 
sources and even allows users to take advantage of parallel process- 
ing and specialized hardware. It also provides a way for machines to 
advertise idle computing facilities. 

The Network Computing System (NCS) is a portable implementa- 
tion of the Network Computing Architecture that runs on UNIX 
systems and other systems. It provides tools, servers, and informa- 
tion brokers to develop distributed applications. NCS extends the 
Domain/OS concept of objects to include replicated objects — cop- 
ies of a single object, all with the same unique identifier. These 
replicas are weakly consistent, that is, all copies of an object may 
not always be in an identical state. However, a weakly consistent 
replica is more likely to be available than a strongly consistent rep- 
lica (one of many copies guaranteed to be identical). 

NCS consists of two major pieces, remote procedure calls and lo- 
cation brokers. Remote procedure calls operate in programs as lo- 
cal procedure calls. However, they allow a process on one machine 
access to data, programs, or devices on another machine, by caus- 
ing a server on the remote machine to execute subroutines on the 
program's behalf. Remote procedure calls enable an application to 
be run in a distributed fashion, without a programmer having to 
rewrite the application. 

In a network where considerable distributed processing occurs, ap- 
plications must have access to information about available machines 
and CPU cycles. Location Brokers store and administer this type of 
information, allowing NCS applications to bind dynamically to dif- 
ferent services without being rewritten. Location Brokers can be 
replicated easily, to provide reliability. Many services augment the 
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NCS tools, including license servers that provide more flexible soft- 
ware licensing. 

NCS is portable and allows system software and applications to ob- 
tain services that are implemented on both Apollo's and other 
manufacturers' machines. Because these services may be impossible 
or expensive to provide locally, NCS truly extends the power of the 
local system. 

In addition, NCS increases homogeneity in system administration. 
For example, Apollo uses a distributed registry — a database that 
holds all user account and group and organization information. This 
distributed registry is portable to any UNIX system. Thus, a mixed- 
vendor network can handle user accounts under a single unified 
mechanism. 



Support for Multiple Operating System 
Environments 

The state of affairs in the UNIX world today is such that there are 
really two de facto standards: AT&T's System V and Berkeley's 
VMUNIX. Until a single standard is accepted by an overwhelming 
portion of our customer base, Domain/OS will include and support 
both variants. 



We view Domain/OS as a single operating system with a very rich 
set of system services (primitives). Some of these primitives are 
found in other UNIX systems (system calls and library routines). 
Other interfaces are unique to Domain/OS. 

Programmers are free to use whatever primitives they like, but pro- 
grams designed to be portable to other UNIX systems should re- 
strict themselves to the standard UNIX interfaces. Programs that 
use only the proprietary Domain/OS primitives are sometimes 
known as "Aegis programs," mainly because they use calls that 
have historically been documented in the Aegis reference manuals. 

In order to provide maximum portability of software from Domain/ 
OS to other Berkeley or System V UNIX systems, Apollo provides 



Domain! OS Design Principles 2-21 



two complete and separate UNIX environments, rather than a hy- 
brid of the two. Any workstation can have one or both UNIX envi- 
ronments installed, and users can select which environment to use 
on a per-process basis. 

Two key mechanisms support this facility. First, every program can 
have a stamp applied that says what UNIX environment it should 
run in. The default value for this stamp is the environment in which 
it was compiled. 



When the program is loaded, the system sets an internal run-time 
switch to either berkeley or att, depending on the value of the 
stamp. Some of the UNIX system calls use this run-time switch to 
resolve conflicts when the same system call has different semantics 
in the two environments. 



The other mechanism is a modification of the pathname resolution 
process, such that pathname text contains environment variable ex- 
pansions. For example, the pathname ftmp/$(ABC) would expand 
to /tmp/Ex00324, if the environment variable ABC had the value 
Ex00324 in the current process. 

When UNIX software is installed on a node, the standard trees 
(/bin, /usr) are installed under directories called bsd4.3 and 
sys5.3. The names /bin and /usr are actually symbolic links de- 
pendent on the value of an environment variable named SYSTYPE. 
That is, /bin is a symbolic link to IS (SYSTYPE) /bin. When the pro- 
gram loader loads a stamped program, it sets the value of SYSTYPE 
to either bsd4.3 or sys5.3, according to the value of the program 
stamp. Therefore, a program that refers to a name in one of the 
standard directories will access the correct version for its environ- 
ment. 
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Support for Multiple Processors 

The new generation of workstations includes multiple tightly cou- 
pled CPUs in one system. However, the UNIX system was originally 
designed to run on single processors. Jans points out [4] that a 
major problem is not just to fix the way critical sections in the 
UNIX kernel are protected for a multiprocessor, but simply to iden- 
tify all the critical regions. This is a problem because a UNIX kernel 
process assumes that other kernel code will not refer to kernel data 
structures unless the kernel process explicitly gives up the processor 
through a call to sleep 0, or unless an interrupt occurs. 

Single-processor UNIX systems handle the interrupt case by tem- 
porarily raising the processor priority to a high enough level to pre- 
vent any interrupt handlers from altering a critical data structure. In 
a multiprocessor system this is not sufficient, since multiple kernel 
processes could be running simultaneously. Therefore, access to 
data structures inside critical sections must be explicitly synchro- 
nized with a mechanism like semaphores or eventcounts. 

In Domain/OS, eventcounts form the basis of synchronization 
among processes. Every critical section is protected by a mutual 
exclusion lock/unlock pair. When a process reaches a critical sec- 
tion, it must be able to acquire the lock before it can continue. 
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Conclusions 



Domain/OS is designed to meet the challenges offered by today's 
technology, and to solve the problems faced by today's users and 
system developers. 

Foremost among these challenges is the need to share computing 
resources of diverse origins as seamlessly as possible. Meeting this 
need implies strong support for existing standards and pushing be- 
yond the standards where they fall short of users' needs. At the 
same time, part of the Domain/OS philosophy is to take advantage 
of the inherent homogeneity among Apollo systems to provide high 
levels of performance and transparency. A prime example of this is 
the Apollo distributed file system with its uniform name space and 
simple administration. 

An important goal of Domain/OS is to provide flexibility for future 
changes and expansion, both by Apollo and by its customers. To 
this end, the system employs the principles of object-oriented de- 
sign, in which pieces of functionality can be replaced with little or 
no impact on other portions of the system. 

To provide the openness necessary for users to adapt the system to 
their own needs, the same mechanisms used internally to structure 
the operating system are all made available to users. 

In these ways, we expect that Domain/OS will prove to be a supe- 
rior base for meeting the challenges of the next several years. 
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Extensions to UNIX Signal Functionality for 
Modern Architectures 

by 
Dave McCracken 



Abstract 



While signals have long been a useful feature in UNIX systems, 
some modern innovations, such as shared global libraries and user 
space shared semaphores, have made the current implementation 
inadequate. In this paper we look at two extensions to the signal 
mechanism that provide the extra functionality required. 

A simple user space inhibit/enable mechanism provides cheap pro- 
tection from signals for critical sections of code, allowing some code 
that is only in the kernel for this protection to run in a shared li- 
brary. 

A nested cleanup handler allows reliable cleanup of semaphores in 
shared memory, as well as other resources the kernel may not oth- 
erwise be able to restore on process death. 



Introduction 



Signals were originally developed as a way of prematurely terminat- 
ing a process, either because of some fault the process generated or 
by some external event. After the ability to trap these signals was 
provided, the meaning expanded to include other events, such as 
timers, suspend/resume mechanisms, and software I/O interrupts. 



Copyright © 1988 Apollo Computer, Inc. Unpublished, all 
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Some of the flaws in the early implementations, such as the inability 
to protect critical sections of code without missing the signal, were 
addressed by Berkeley and, later, by AT&T with the ability to delay 
a signal's delivery. 

New architectures now gaining popularity are again making the ex- 
isting signal model inadequate. With the introduction of sema- 
phores in shared memory comes the need for cheap concurrency 
control, along with the need to clean up the locks if a fault occurs. 
Global shared libraries are taking over some of the functionality 
previously found in the kernel, which requires protection from 
asynchronous events. 

In this paper we examine some of the solutions implemented as part 
of the Apollo Domain/OS implementation of a UNIX operating sys- 
tem. Cheap nested cleanup handlers solve the semaphore problem, 
and a fast inhibit/enable mechanism protects the libraries during 
critical code. 



Global Shared Libraries 



The Problem 

In Apollo's Domain/OS, the use of global shared libraries has al- 
lowed some of the functionality traditionally found in the kernel to 
be moved into user space. In fact, a large portion of the kernel 
commonly considered part of the "system cair environment has 
been moved to libraries, with more primitive calls into the kernel 
for things that really require the additional privileges the kernel pro- 
vides. 

A problem with this model is that, while this code does not really 
require kernel protection and privileges, it does require protection 
from asynchronous interrupts, or signals. While this can be pro- 
vided by a kernel primitive, in many places performance is also 
important and another system call, with its attendant overhead, is 
too expensive. 
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The Solution 



One area that has benefited from moving functionality into libraries 
is the UNIX signal subsystem. While parts of the delivery mecha- 
nism must be in the kernel, particularly the decision whether to 
interrupt the target at all, much of the user-supplied handler infor- 
mation can be stored in user space and only a subset of information 
forwarded to the kernel portion so it can make its initial delivery 
decision. 

The two parts of the handler are tightly coupled and include a 
handshake mechanism, so the kernel handler does not consider the 
signal delivered until the library code acknowledges receipt of the 
signal. This occurs just before the user-specified handler is called. 

With this model, the problem of inhibiting the delivery of signals 
becomes simple to solve. We have created an inhibit counter in 
user space that may be incremented and decremented by calls to a 
pair of small library routines. Since there are no kernel traps in- 
volved, these calls are very fast, typically less than ten machine 
instructions. This allows even commonly used code to easily pro- 
tect its critical code sections without seriously degrading perform- 
ance. 

The inhibit counter is integrated into the signal handler at the point 
where the signal is delivered to the library handler in user space. 
The first operation performed in the handler is to check this 
counter. If it is non-zero, the handler sets a flag indicating a signal 
is pending and returns to the interrupted code without delivering 
the signal any further. Since the kernel handler is waiting for the 
library handler to acknowledge receipt of the signal, it is left pend- 
ing, and other signals are blocked until this condition is cleared. 

When the inhibit counter is decremented to zero and a signal is 
pending, the library requests a re-signal from the kernel handler. 
This time the check in the library handler will pass and the ac- 
knowledgment will be sent, allowing the kernel handler to mark 
that signal as delivered and no longer pending (see figure) . 
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Semaphores in Shared Memory 



The Problem 

Some form of handler is necessary to trap any fault or signal that 
the process may receive while the semaphore is locked. The signal 
handling mechanism provides some of this functionality, but falls 
down in several areas: 

• Each signal that may be received must have a handler set 
for it. 

• Cleanup needs to be done even for "uncatchable" signals, 
i.e., SIGKILL. 

• It does not allow nesting of semaphore locks in different 
sections of code with separate cleanup handlers. 

The Solution 

Apollo's solution to the problem of cleaning up user-space sema- 
phores, along with other important user state, is to allow the user to 
set dynamic cleanup handlers. These handlers are similar to 
setjmp, in that the user specifies the save area and that the return 
status indicates whether it was set or invoked. 

The cleanup handler differs from setjmp, however, in two impor- 
tant ways. First, these handlers are integrated into the signal han- 
dler and are automatically invoked whenever there is no user- 
specified handler for a signal that would cause process death. Sec- 
ond, the cleanup handlers can be nested. Setting a new handler 
when there is one already set behaves like a stack push, making the 
new handler the first one invoked. The previous handler is re- 
established as the current handler when the new handler is invoked 
or released. 

When a handler is invoked, it returns again from the "set" call, 
with the return status indicating what signal or error caused the in- 
vocation. The user program then has several options. If it was an 
error the routine was expecting, it can re-establish the cleanup han- 
dler and continue or return synchronously to its caller. Otherwise, 
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the convention is for the routine to call a special routine that will 
invoke the next handler in the chain, passing the status on. 

If a routine that set a cleanup handler completes normally, the han- 
dler must be explicitly released before the routine returns to its 
caller. 

An additional guarantee that all cleanup handlers are run is that the 
exit call invokes the cleanup handlers before the process is termi- 
nated. This ensures that any global machine state is reset no matter 
what path the process takes when it dies. 

To ensure that the cleanup handler chain is kept intact during a 
longjmp, all cleanup handlers set between the longjmp call and the 
code where the setjmp was done are invoked. A special status is 
used for these cleanup handlers to indicate that a longjmp is in 
progess. 



Conclusion 

When new functionality is added to a UNIX system that interferes 
with the original signal model, we have shown that it is possible to 
extend that model in a compatible way to support the more com- 
plex interactions required by the new features. This issue will gain 
even more importance in the future as processes rely more on fea- 
tures that cannot allow uncontrolled signal interrupts or process 
death. 
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Detailed Signal Delivery Sequence 

• The signal is passed to the kernel with kill(). 

• If the signal is explicitly ignored or "ignore by default," it 
is dropped, else it is set pending. 

• As soon as the signal is not blocked, the target process is 
interrupted. 

• If a signal stack is specified, the user stack pointer is 
switched. 

• The kernel signal delivery pending flag is set. 

• The library signal handler is called. 

• If the inhibit count is non-zero, the library signal handler 
flags that a signal is pending and returns to the interrupted 
code. 

• When the enable is called that sets the inhibit count to 
zero, the kernel is notified and the pending signal is re-de- 
livered. 

• When the inhibit count is zero, the library handler ac- 
knowledges delivery of the signal, and the kernel signal de- 
livery pending flag is cleared, freeing the kernel handler to 
deliver another signal. 

• The user signal handler is checked and, if it is not default, 
it is called. When it returns, if it does, the signal is dis- 
missed and the process resumes where it was interrupted. 

• If there is no user signal handler, the first cleanup handler 
in the chain is invoked. 
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Example of Using a Cleanup Handler 



write_shared_memory (address , value, sem) 
int *address; 
int value; 
semaphore *sem; 

{ 

pfm__$cleanup__rec cl_rec; 
status__$t status; 

/* trap any errors */ 

if ((status = (pfm_$cleanup(cl__rec) ) ! = 
pfm_$cleanup_set) { 
mclear (sem) ; 

/* check for bad address fault */ 
switch (status. all) { 

case mst_$illegal_address: 

(status. all == mst_$illegal_address) ) 
return(l); /* the write failed */ 
pfm_$signal ( status ) ; /* an anonymous prob- 
lem, pass it on */ 

} 

mset(sem, 1); /* lock the semaphore */ 
*address = value; /* do the write */ 
mclear (sem); /* clear the semaphore */' 

pfm__$rls_cleanup(cl__rec, status); 

return (0) ; /* the write succeeded */ 

} 
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Shared Program Libraries — 
The Domain/OS Library Model 

by 
Bryan Douros 



Introduction 



At one time, it was necessary for each program to include every- 
thing it needed to solve a particular problem. The scope of prob- 
lems has expanded, however, and collections of library subroutines 
and standard system functions have allowed applications to grow 
into very complicated programs. Programmers draw on graphics, 
communications, database, and other services to make applications 
more usable, but increased functionality also makes them larger 
and more difficult to maintain. 

Modular programming practices and higher-level languages enable 
programmers to maintain their applications more easily. Collections 
of related routines are combined into libraries of services, so that 
many separate applications can now draw on the same services. In 
most systems, library and application routines are bound into a sin- 
gle program at link time. While this has extended what applications 
can do, there are several weaknesses to this scheme. Linking the 
same routine to many applications requires the duplication of the 
routine in each application program. This uses more disk space and 
increases the working set of the applications. 

This scheme causes distribution problems, too. No matter how 
modular an application is, bug fixes and performance improve- 
ments require relinking and redistributing all the programs that use 
these fixed routines. Distribution and maintenance become even 
more complicated when different libraries are managed by different 
groups, organizations, or even different companies. 
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To overcome some of these weaknesses, Domain/OS provides the 
ability to share libraries among programs with the following effects: 

• Linking to libraries can be deferred until program execu- 
tion. 

• Libraries can be distributed independently of application 
programs. 

• The same libraries can be shared with many concurrently 
executing programs. 

Domain/OS uses shared, direct paging of program code so as to 
only have one copy of the code in memory. It also uses indirect 
linkages to shared library references to minimize the relocation nec- 
essary at load time and position-independent code so that shared 
libraries can be loaded into any address space. Domain/OS also 
uses a table of known globals to make linking the correct routines 
easy and extensible. 



Indirect Linkages 

Domain/OS compilers generate linkages to shared library refer- 
ences by way of indirect pointers, so that read-only code doesn't 
require relocation. Procedure references call through transfer vec- 
tors (which are jump instructions) stored in read-writable sections 
to the shared library routine, and are relocated at load time. Data 
references use indirect references via an absolute pointer, also 
stored in read- writable sections and relocated at load time. This 
allows direct paging of shared read-only program code and minimal 
relocation of linkage addresses at program loading time. 



Position-Independent Code 

Domain/OS compilers are capable of generating position-independ- 
ent code (PIC) , so that the bulk of a program may be loaded at any 
free virtual address space the process has. Thus, a minimum of 
relocation needs to take place at load time. The compilers use pro- 
gram counter (PC) relative branches and instructions, and take the 
indirect linkages to the extreme, so that all external linkages are 
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indirect. This allows shared libraries to be loaded quickly into the 
address space, relocating only the external linkages. 



Global Address Space 

Domain/OS divides the user address space of a process into two 
parts, the first of which is private to the process and the second of 
which is global to all processes at the same virtual address in every 
process. This feature allows shared libraries to be loaded once into 
address space, after which every process has access to them at no 
cost. Important libraries that are used by every process are dealt 
with in this way. 

The private address space is user accessible and manageable. The 
global space is managed by the process manager (PM) and loads 
suitable shared libraries into the process's address space. 



Known Global Table 

The known global table (KGT) is a system table that keeps track of 
the symbols exported by all the shared libraries known to processes. 
Each process maintains its own logical KGT, but by maintaining a 
global KGT (containing symbols known to all processes) and a pri- 
vate KGT (symbols added to this process), the system can maintain 
a symbol table that is small but can be completely customized. 

When a program with an unresolved external reference is loaded, 
the KGT is consulted and references that point to shared libraries 
are linked to those libraries. References to a shared library that has 
not yet been loaded either cause the library to be loaded and linked 
immediately, or cause it to be loaded when the program actually 
runs and attempts to call into that library. 
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The KGT actually consists of three tables: 

• The first table, the private KGT, contains symbol entries 
currently loaded into the private A space of the process. 
Symbols in this table are private to this process. Entries 
contain the ASCII name, the address of the symbol, and 
information as to the type of symbol (function, data, or 
common) . The table is indexed by a hash of the symbol 
name. 

• The second table, the known library table (KLT), is a ta- 
ble of symbol entries that are in shared libraries. Symbols 
in this table are private to this process and are entries con- 
taining the ASCII name, a reference to the shared library 
so that it can be loaded when needed, and a set of flags in- 
dicating the behavior that this library should have (loaded 
on program execution or loaded at program reference). 
The KLT is inherited by child processes (created via fork, 
exec, or pgm_$ invoke). This is useful for building a set of 
shared libraries to be used in the current environment. 

• The third table, the global KGT, is a table of public symbol 
entries. These are symbols known to all processes. The 
global KGT contains all the symbols of libraries loaded 
into the global space and symbols of shared libraries that 
are to be loaded into the process's address space as 
needed. Entries contain a compressed form of the name 
(32 bits), its type (function, data, or common), either an 
address of the symbol in global space or information simi- 
lar to that in the KLT that describes the library and when 
it should be loaded. This table is also indexed by a hash of 
the symbol name. 

The global KGT is constructed at system boot time and is 
very large, since it contains all the symbols known to all 
processes. It can use a highly compressed form of the 
name, since it is built all at once and compressed symbol 
collisions are dealt with specially. This allows a table that 
can be searched very quickly to find symbol names. 

The KGT is searched in the following order: the private KGT, the 
KLT, then the global KGT. This allows private symbols to super- 
sede symbols defined in the public space to allow library develop- 
ment or private program tuning, while leaving the public libraries 
unchanged. The behavior of duplicate symbols in the same table is 
undefined. 
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Global Libraries 



Global libraries are a special case of shared libraries. As described 
above, libraries loaded into global space are visible to all processes 
and have no cost at program load time. To accomplish this, Do- 
main/OS loads all global libraries at system boot time. 

Global libraries have four classes of storage: read-only procedure 
text, global storage (which gets turned into read-only after library 
initialization), dynamic storage (run-time stack), and process- 
private storage (which is zeroed at the beginning of each process) . 

Pure functions would be very easy to make into a global library. If a 
library has statically initialized state, it is more complex, but the 
benefits of global libraries are great and most Domain/OS libraries 
are global libraries. 



Dynamic Linking 

Domain/OS supports a limited form of dynamic linking. A program 
loaded with references to undefined procedures has a special jump 
vector created which points to a dynamic link snapper routine. 
(References to undefined procedures were either not found in the 
KGT or found and specified as load-on-reference.) 

During execution, a call to this undefined symbol is vectored to the 
dynamic link snapper routine and is passed the ASCII text of the 
symbol name. It looks for the symbol in the KGT. If the referenced 
library is not loaded yet, it is loaded, the jump vector is patched to 
the symbol, and the routine is executed. Future references have no 
extra performance cost. 

If the symbol is not found, a fault is generated to the program. This 
can be useful for programs that have many modes and don't always 
execute all of their code, for it can defer the expense of loading 
shared libraries that might not be used at all. Unfortunately, this 
works only for procedure references and not for data references. 
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System Configuration 

Domain/OS reads the file /etc/sys.conf at system boot time to de- 
termine which shared libraries should be loaded globally, which 
should be loaded on program execution, and which should be 
loaded dynamically, at program call time. It contain entries of the 
following form: 



Iib[rary] shared Jibraryjpathname flags 

The shared Jibrary ^pathname may either be relative to the /lib di- 
rectory or an absolute pathname. There may be zero or more flags, 
separated by spaces or commas (,). The flags are global, dynamic, 
and optional. 

Global means that the library is to be loaded into global address 
space. Dynamic means that the library loading should be deferred 
until the actual execution of the routine utilizing the dynamic link- 
ing feature. Optional suppresses the error message if the library is 
not at the specified pathname. 

If global or dynamic is not specified, the library is loaded at pro- 
gram execution time. Because global address space is a limited re- 
source on some older Apollo workstations, we included the follow- 
ing flags for compatibility: 

not_16mb_va 

do not load this global on a 16MB virtual address space 
machine (DN300, DSP80) 
not_64mb_va 

do not load this global on a 64MB virtual address space 
machine (DN330, DSP90, DN5xO, some DN3000) 

These commands can set by the workstation's administrator to 
specify which shared libraries will be in the public KGT, the set of 
symbols known to all processes. Changes to the sys.conf file 
changes the public symbols only at the next system boot time. 
Global libraries must have no duplicate symbol definitions and only 
external references to other global libraries. 
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Private Configuration 



There are several methods whereby users can specify additional 
shared libraries. Shells distributed with Domain/OS have two added 
internal commands, inlib and Hib. The inlib command adds a 
shared library to the KLT for the current process. The llib com- 
mand lists the shared libraries in the current process's KLT. Cur- 
rently, all shared libraries added with inlib are loaded at program 
execution. 

Programmers can also specify at program link time what libraries 
are needed for the program. The /bin/Id and /com/bind tools have 
options which allow you to include libraries. When the program is 
executed, those shared libraries are loaded with the program. 
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The Domain/OS Input/Output System 

by 
David A. Buckle 



Abstract 



While Domain/OS offers several novel features in the area of exten- 
sible I/O [3] and user-level device drivers [1], there is still a need 
for basic I/O services within the kernel. Continual development of 
new workstation platforms for Domain/OS requires continual en- 
hancement of the kernel I/O system to support these new plat- 
forms. 

This paper briefly describes the components of the kernel I/O sys- 
tem, and identifies the approaches taken to reduce development 
costs associated with new peripheral support. 



Introduction 



Domain/OS is the operating system kernel that runs on all Apollo 
workstations. It provides support for a variety of different peripher- 
als across the complete range: 

• Winchester Disk 

• Floppy Disk 

• Mass Storage Module 

• Apollo Token Ring 

• IBM Token Ring 
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• ETHERNET* 

• Cartridge Tape 

• Color and Monochrome Monitors 

One of the I/O system design goals was to simplify the development 
of new drivers. This has been achieved by reducing the interaction 
between driver modules and the other components of the operating 
system, and by providing common interface modules for the various 
classes of device present in the system. The next section describes 
this structure. 



Overall Structure of the I/O System 

The I/O system is implemented in several layers: 

• Device Class 

• Device Drivers 

• Resource Manager 

• Bus Interface 

The following figure shows the relationships among layers. 



* ETHERNET is a registered trademark of Xerox Corporation. 
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Relationships of Layers in the I/O System 

While lower levels normally provide services to the layers above, 
the organization is not strictly hierarchical; class modules can pro- 
vide device-independent routines for their drivers, and device driv- 
ers have access to the control and status registers (CSR) of the pe- 
ripheral controllers without calling resource level routines. 



Device Classes 



Domain/OS associates each device with one of a number of device 
class modules, based on the functions and use expected of the de- 
vice. 
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All I/O requests from the upper layers of the operating system are 
made through the class module, which then vectors these calls to 
the driver supporting the requested device. 

Each class provides a set of procedural interfaces to the upper lev- 
els of the operating system, tailored to the class, and not necessarily 
the same as the interfaces provided by other classes. This approach 
contrasts with the UNIX device model where all devices fall into 
one of two types, block and character, all block devices having one 
set of interface routines, and all character devices having another. 

Thus, for example, the network class can provide procedural entry 
points that reflect the functions expected of network devices, 
whereas UNIX network drivers are forced to map these network 
specific functions onto a generic entry point such as ioctl. 

The device class modules can also provide device independent rou- 
tines for use by the driver modules. For example, the disk class 
provides routines to sort queues of transfer requests, enabling disk 
drivers to easily manage head scheduling. 



Device Drivers 



All Domain/OS device drivers have the same basic structure; a set 
of entry routines called by the class module and run in a process 
context, and one or more interrupt handlers that execute 
asynchronously. 

Driver entry routines are always called from the class module. Syn- 
chronization between the in-process driver routines and the 
asynchronously called interrupt handlers is accomplished via an 
eventcount mechanism [2] . Mutual exclusion locks are used to pro- 
tect common data structures. 



Resource Manager 



The I/O resource manager module provides a central place for the 
management of the hardware and software resources required by 
the device drivers. These resources include hardware interrupt vec- 
tors and software memory regions. 
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The module provides services to both system devices and also to the 
Peripheral Bus Unit (PBU) support module, which in turn provides 
services to user-level GPIO device drivers. 

Most resources are acquired during system initialization rather than 
being statically reserved at system generation. Devices that are not 
present in the hardware configuration do not lock up any hardware 
resources, which are therefore available for user-level devices. This 
approach to resource allocation works well with the fixed configura- 
tion aspects of Domain/OS, and allows the same version of the op- 
erating system to be used on different hardware configurations with- 
out imposing restrictions on user-level drivers. 



Bus Interface 



Bus modules are provided for the various I/O buses present on 
Apollo hardware. All bus modules provide the same set of service 
routines. 



Initialization 



During system initialization, the I/O resource manager is called to 
perform I/O-specific initialization tasks. Associated with each de- 
vice controller in the system is a device descriptor data structure 
that describes the hardware characteristics of the controller, and 
contains a pointer to the supporting device driver initialization rou- 
tine. The I/O resource manager traverses the controller data struc- 
tures, and calls each initialization routine. 

Each driver checks for the presence of its controller, and if it is 
present, the driver registers itself with its known class module, and 
with the I/O resource manager. 

Class registration enables the driver to pass an entry point vector 
(EPV) to the class module. All calls from the class module into the 
device driver are made through this EPV, which contains pointers 
to all the exported procedures. 

Each device class is free to define the set of entry points most rele- 
vant to the class, when then define the formal interface between the 
class and any driver supporting that class. 
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During the initialization phase, the driver also registers its interrupt 
handlers with the I/O resource manager. Thus, the only entry point 
that needs to be directly exported from the driver module is the 
name of the initialization routine. 



Configuration 



Configuration of device drivers within an operating system can be 
considered in three separate stages: 

• Specifying which device drivers are required in the system. 

• Specifying the hardware characteristics of the devices. 

• Associating a logical name to be used by the remainder of 
the OS and/or the user programs with a specific hardware 
device. 

Domain/OS traditionally runs on a limited set of hardware configu- 
rations and does not provide any means for on-site tailoring of the 
kernel. All devices needed by the system that could be present on 
any hardware platform must be bound into the operating system 
executable. 

While this does restrict the system somewhat, it also has these bene- 
fits: 

• There is no need for a local administrator to perform on- 
site configuration. 

• In a local area network, nodes can be booted from any 
other node holding the appropriate operating system exe- 
cutable without any problems arising from incompatible 
configurations. Standard naming conventions are used to 
locate the actual OS file. 

Internally, however, the software is organized around the above 
configuration requirements. The device descriptors used during sys- 
tem initialization effectively provide a central description of the 
hardware configuration. Each descriptor represents a device; the 
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descriptor contents are used to specify the hardware characteristics 
and the device name to be used to address the device. 



Summary 



Most new devices fit under one of the existing Domain/OS device 
classes, and hence can take advantage of the generic device support 
provided by the class modules. 
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Disk Class to Driver Interface 

The disk EPV contains the following entry points: 



disk_open_proc 



Open a disk device and return drive parame- 
ters to the caller 



disk_close_proc 



Close a disk device 



disk_spindown_proc Spin down active disks (called only from 

os_$shutdown) 



disk_revalidate_jproc Check removeable media changes 



disk_do_iojproc 



Service data transfer requests; multiple blocks 
may be requested with a single call to this 
routine; on return it indicates whether all 
transfers have been completed 



disk_error_quejproc 
disk_get_statsjproc 



Check error status of non-blocked queued 
requests 

Return disk controller statistics (number of 
reads/writes, etc.) 
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Network Class to Driver Interface 

The disk EPV contains the following entry points: 



net_start_proc 

net_stop_proc 

net_load_proc 

net_send_proc 

net_get_stats__proc 

net_ioctl_mt 

net_p2_cleanup 

net__open_svc 

net_close_svc 

net_send_svc 

net_rcv_svc 

net ioctl svc 



Start network service 



Stop network service 



Load firmware 



Transmit a packet 



Return network device statistics 



Control network sends/receives 



Cleanup (process termination) 



User-level open routine 



User-level close routine 



User-level send packet routine 



User-level receive packet routine 



User-level control routine 
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Bus Module Interface 



bus_init_proc 



Initialize bus module and data structures 



bus alloc iova 



Allocate an address in the bus address space 



bus xlate iova 



Translate bus address into physical page num- 
ber 



bus define int 



Associate an interrupt handler with an inter- 
rupt vector 



bus_enable_device 
bus disable device 



Enable device interrupts for a device 



Disable device interrupts for a device 



bus_device_interrupting Check whether device has an interrupt pending 
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Extending the UNIX Protection Model 
with Access Control Lists 

by 
Gary Fernandez, Larry Allen 



Abstract 



The UNIX operating system uses a simple and straightforward 
model for the protection of objects in the file system, granting ac- 
cess rights based on the ownership of the object. This simple model, 
however, may not be flexible enough for large user communities, or 
for communities with complex requirements for controlled data 
sharing. This paper describes an Access Control List extension to 
the UNIX protection model, which preserves the behavior of the 
existing UNIX programming interface while greatly increasing the 
flexibility of the protection system. 



Introduction 



This paper describes our efforts to extend the UNIX protection 
model by adding Access Control Lists (ACLs) . We first provide a 
summary of the UNIX protection model, then describe our ex- 
tended protection system. We describe how we integrated the ex- 
tended protection system with the UNIX protection system and dis- 
cuss some of the details involved in implementing the extended pro- 
tection system. Examples demonstrate the extended protection sys- 
tem, and finally, we summarize and describe a few lessons we 
learned. 



Copyright © 1988 Apollo Computer, Inc. Unpublished, all rights 
reserved. 
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Overview of the UNIX Protection Model 

This section first describes the UNIX Protection Model, and then 
discusses why extensions are appropriate. 

The UNIX Protection Model 

The concepts of owner ids and group ids are the basis of the UNIX 
protection model [7]. Each file and every process has an associated 
owner id and group id. With BSD4.3 a process may have multiple 
groups. Processes have two sets of ids: effective ids are used for 
rights checking; real ids keep track of the true ids for a process for 
which the effective ids have been altered temporarily. 

The owner and group ids for a process are inherited from the par- 
ent process. When a file is executed the effective ids for the proc- 
ess may be taken from the executable file, depending on attributes 
attached to the file: the setuid and setgid bits. System calls may 
change a process's owner and group ids, but these operations are 
highly restricted. 

The owner and group ids for a file are inherited from the ids of the 
process that creates the file. BSD4.3 takes the group id in file crea- 
tion from the parent directory, not from the process. A file's owner 
or the super-user may change a file's owner and group ids by using 
system calls. BSD4.3 restricts these changes to the super-user. 

Each file has protection information for three categories of users: 

• Owner — processes whose owner id matches the owner id 
of the file 

• Group — processes that are not in the owner category and 
whose group id(s) match the group id of the file 

• Other — processes that are not in the first two groups 

Protection is checked in the order owner, group, and other; the 
first matching protection applies. 

Protection permissions are read, write, and execute for files, and 
read, write, and search for directories. 
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Only the owner of a file or the super-user may change the permis- 
sions associated with a file. 



Discussion of UNIX Protection Model 

The UNIX protection model works well when the set of persons 
accessing a file is a single person or a single group. However, once 
it is necessary to allow special access to more than one person or 
more than one group, the UNIX system is unable to describe this 
protection. Specially created groups partially solve this, but prove 
difficult to administer. 

Traditional protection alternatives are capabilities and access con- 
trol lists. Capabilities operate by passing a ticket allowing access to 
an object [4]. Restrictions may be placed on the number of copies 
of a capability, whether it may be passed on to other processes, 
access rights available, etc. Access control lists provide a list of 
(person, rights) pairs. By comparing entries in the list with the sub- 
ject requesting access, a matching entry is located. The matching 
entry then determines the applicable rights. 

Access Control Lists are not new. Multics was an early system using 
Access Control Lists as a basis for its protection system [6]. Aegis, 
Apollo's proprietary operating system, also used a protection system 
based on Access Control Lists [3, 5]. Aegis is the predecessor 
operating system on which our extended protection system is built. 



Overview of Extended Protection System 

This section describes the extended protection system. It provides 
an overview, describes how objects are protected, and explains how 
protections are assigned. 
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Extended Protection System Basics 

The extended protection system provides several additions to the 
basic UNIX protection: 

The number of recognized organizational divisions is extended from 
two (person and group) to three (person, group, and organization) . 
This more closely matches the real world divisions that occur in 
large organizations and large networks. The concept of organiza- 
tion allows a more hierarchical partitioning of the users of the sys- 
tem. The network user registry could be partitioned based on 
organizations, for example. Each process has, in addition to a real 
and effective person and group ids, real and effective organization 
ids associated with it. Each file system object also has associated 
with it an owning organization. 

New protection rights are defined which allow persons other than 
the object owner to change the protection of an object and which 
prevent an object from being accidentally deleted. 

Additional protection entries describing protection for persons be- 
side the protections for the owner, the owning group and "other" 
are added. These additional entries may be used to provide more 
granularity in the granting or denial of access. 

Each directory contains "initial" protection information which is 
used to determine the initial protections which are applied to ob- 
jects which are created in the directory. 



Protection of an Object 

Each object has associated with it an owner, an owning group, and 
an owning organization. Rights may be associated with each of the 
owning fields or with "world," which is used for processes not 
matching any other protection entries. Because this information is 
always present and is used as entries in rights checking, we refer to 
this information as required entries. Optionally, an object may have 
associated with it extended entries, which are Access Control List 
entries. ACL entries are pairs of the form (subject identifier, 
rights). A subject identifier (SID) consists of three fields: the per- 
son, the group, and the organization. SID entries in extended en- 
tries may have each portion of the SID wildcarded: the character 
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% is a wildcard with the meaning "match anything in this field." 
An example of an extended ACL entry might be: 



j ohn_doe . % . r_d rwx 

This means that john_doe has read, write, and execute rights if he 
is a member of any group in the r_d organization. Extended en- 
tries provide finer control over access to objects. 

Each process has both a real and an effective subject identifier. 
The real SID corresponds to the original identification of the proc- 
ess, while the effective SID may be different as a result of setID 
programs. SetID programs work as in UNIX protection; an executa- 
ble object may be marked so that any process running the program 
will have its effective SID changed while the program is running. 
Setid may change all three fields of the SID. 

The extended model includes the standard UNIX protections of 
Read, Write, and Execute (Search). We have also added new 
rights: Protect, which determines whether an SID may change the 
protection on an object; and Keep, which prevents an object from 
being deleted or having its name changed. In addition, a required 
entry may be marked "ignored." This means that the owner, 
group, or organization information is present, but that rights check- 
ing will not use this information. 

Rights checking is similar to rights checking in the UNIX operating 
system. Rights checking proceeds as an ordered walk through the 
protections specified in the required entries and the extended en- 
tries. Given an effective SID, rights are examined as follows: 

• If the effective person matches the owner, the owner rights 
are returned. 

• If the effective person matches any extended entries of the 
form person. X.X, where X means don't care (these are 
person entries), the rights from the entry are returned. 

• If the effective group matches the owning group, the group 
rights are returned. 

• If the effective group matches any extended entries of the 
form %. group. X (group entries), the rights from the entry 
are returned. 
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• If the effective organization matches the owning organiza- 
tion, the organization rights are returned. 

• If the effective organization matches any extended entries 
of the form %.%.org (organization entries), the rights 
from the entry are returned. 

• Rights available to "world" are returned 

Note that the algorithm always checks the specific required entry 
before extended entries of the same type. Also note that there is 
no requirement at any point that extended entries of a particular 
type be present; rights checking proceeds to its next phase if none 
are present. A summary rights mask limits rights available in ex- 
tended entries. This mask is described in more detail later. 



Assigning Object Protection 

Protection information is normally applied to an object when it is 
created. Inheritance from the directory in which the object is cre- 
ated determines the necessary protection information. Each direc- 
tory has two sets of protection information: initial file protection, 
which is used when a file is created in the directory, and initial 
directory protection, which is used when a sub-directory is created 
in the directory. These initial protections are in addition to the 
standard protections associated with all objects. 

The extended protection system provides several different styles of 
initial protection (Aegis, BSD4.3, System V.3) to allow different 
classes of users to have their favorite object protection. Aegis users 
typically give specific protection information to be applied to all 
objects created in a directory. System V.3 users expect objects 
created in a directory to be given protection information based on 
the effective SID of the process creating the object. BSD4.3 users 
expect the SID of the creating process and the containing directory 
to determine the protection for objects created in the directory. 
These alternatives are all provided by initial protections associated 
with directories. 

File creation can be viewed as taking place in two steps. First, the 
object is created using default information based on the creating 
process. The default information may then be overridden by initial 
file or directory protection information. 
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Two pseudo rights for initial protections allow appropriate UNIX 
behavior: "inherit SID information from process" and "use rights 
specified by the process, masked by umask." The first pseudo right 
means the SID information (process, group or organization) is not 
to be overridden by information in the directory. The second 
pseudo right means the rights specified by the process (and modi- 
fied by "umask") are not to be overridden by information in the 
directory. These pseudo rights allow UNIX behavior while allowing 
Aegis style inheritance to continue to override the process informa- 
tion. It is necessary to distinguish between BSD4.3 and System V.3 
semantics because BSD4.3 specifies that the owning group is to be 
inherited from the containing directory, whereas in System V.3 the 
owning group is inherited from the creating process. 



Integrating ACLs with UNIX Protections 

This section describes how the extended protection system extends 
the standard UNIX protection model, paying particular attention to 
the behavior of the standard protection-related UNIX system calls 
when extended protections are used. Goals describes our goals. We 
also present the motivation for the initial protection mechanism, 
how querying and modifying protections works, architectural princi- 
ples, and the integrated protection model. 



Goals 

In designing the extended protection system, our primary goal was 
to make it possible to use unmodified UNIX programs in a system 
where administrators or users have chosen to use extended protec- 
tions, and get "reasonable" results. Two sets of UNIX system calls 
deal with protection: 

• File creation calls (openQ, creatQ, mknod()> mkdirQ), 
which specify initial protection modes and ownership in- 
formation for newly created files. 

• Calls for querying about or changing file attributes (stat(), 
chmod()> chown()), which allow the client to read or 
modify the protection-related attributes of files. 

For file creation, we wanted to enable system administrators or or- 
dinary users to conveniently use the extended protection system 
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without modifying any UNIX programs; this implies the ability to 
specify initial protections to be applied to a file external to the pro- 
gram creating the file. We wanted to design a "mechanism" for 
specifying initial file protections that could support a variety of 
"policies" for use of the protection system as a whole; in particular, 
because Apollo's UNIX product is a dual port of Berkeley and 
AT&T variations, it was important to support both the BSD4.3 and 
System V.3 policies for initial file protection. 

For the system calls that query and modify file protections, it was 
again important to maintain reasonable behavior for unmodified 
UNIX programs in the presence of extended protections. Because 
standard UNIX system calls only deal with the rights of the owner, 
group, and "others," a secondary goal, time permitting, was to add 
a new programming interface to the extended protection system. 
We felt that the protection policy would be set by users or system 
administrators, or would be specified during installation of software 
subsystems, and hence it would be uncommon for programs to ex- 
plicitly create or apply extended protections. 



Initial File Protection Mechanism 

The UNIX protection model, especially in the Berkeley variations, 
has the beginnings of separation of the protection policy from the 
protection mechanism. In the Berkeley UNIX system, three inde- 
pendent specifiers control initial file protections: 

• The protection mode supplied in the openQ call, from the 
creating program. 

• The setting of the per-process umask, from the user run- 
ning the program. 

• Group ownership, from the directory in which the file is 
being created. 

Each of these specifiers is a potential way of specifying extended 
initial protections; we concluded that the best approach was to ex- 
tend the notion of inheritance from the directory in which the file is 
being created. Because we could not require for existing programs 
to be modified, we could not require changing the open() call. We 
considered extending the "umask" notion to include the ability to 
specify extended protection information; this had the wrong level of 
granularity for the specification of a protection policy. There is no 
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reason to assume that all files created by a given process should be 
protected the same way (consider a compiler creating temporary 
files for intermediate results and a binary file for the final output) . 
Specifying the protection policy on a per-directory basis was intui- 
tively appealing; it matched common uses of the file system (for 
example, all source files in a source tree would have common pro- 
tections) ; and it had worked well in practice in the Aegis operating 
system. 

Beginning with this idea, we evolved the model for specifying initial 
protections. We now describe the algorithm for file creation and 
setting the initial protection in more detail: 

• Start with the owner, group, and organization of the creat- 
ing process, and the protection mode value passed in to 
the openQ call. 

• Modify the protection mode value by masking with the 
process's umask. 

• For each required entry (identifiers and protection mode 
values) , if the initial protection specification in the direc- 
tory specifies an explicit value, override the value supplied 
by the process for that entry. This operation is called 
"merging the initial protections." 

• If the initial protection specification specifies extended in- 
formation, apply the extended information to the file. 

• When creating a sub-directory, the initial protection speci- 
fications for the sub-directory are inherited from its parent 
directory. 

The result is a flexible mechanism for specifying the initial protec- 
tions to be applied to newly created files. The examples section 
presents sample protection policies that use this mechanism. 



Inquiring and Modifying Protections 

The most difficult task in extending the UNIX protection model 
with ACLs arose in ensuring that the existing UNIX system calls for 
inquiring and modifying file protections (stat()> chmod(), 
chown()) continued to have reasonable and consistent behavior, 
even in the presence of extended protection information. We real- 
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ized early in the design that there would be circumstances in which 
the UNIX calls could not provide an accurate representation of the 
extended protection information; when using the UNIX calls on 
files with extended protection, there would always be a possibility of 
unexpected behavior. 

Consider the case of the stat() call applied to a file with extended 
protection. StatQ returns (among other things) file protection in- 
formation in three categories: owner rights, group rights, and others 
rights. Clients of the stat() call use this information both to provide 
information to users (for example, in listing directories) and to 
make decisions about program behavior (the shell will not execute a 
script to which the user does not have execute rights) . When ap- 
plied to a file protected with an extended protection, the stat() call 
could do one of two things: 

• It could "lie" about the accessibility of the file. It could 
return the rights for owner, group, and world and ignore 
the presence of the extended entries. Or, it could com- 
bine the rights granted by extended entries into the "oth- 
ers" rights. In either case, clients of stat() will be con- 
fused: either the client will be unable to access a file to 
which stat() claims it has access, or the client will have ac- 
cess rights not represented in the output of statQ. 

• It could construct a protection mode accurately represent- 
ing the rights of its client. This is attractive, as the client 
will never see inconsistencies between the values returned 
by stat() and the behavior of other UNIX system calls, but 
has two serious disadvantages. First, stat() would return 
different values for different clients; two users listing the 
same directory might see different protection modes on 
files. (Previous versions of our system used a similar 
scheme, which led to confusion among users.) Second, 
we felt that the performance degradation caused by com- 
puting the protection mode returned by stat() on a client- 
by-client basis would be prohibitive. Stat() is a frequently 
used call in UNIX systems; it is the only way to obtain file 
attributes, and always returns full information. Expensive 
protection mode computation affects all callers. 

Similarly, what happens when chmod() modifies protections on a 
file with extended protection? The handling of the owner and 
group rights is straightforward, but it's not clear what the "others" 
rights supplied to chmod() mean. The caller could intend for any 
extended entries to be disabled, and the "others" rights to be 
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granted to everyone except the file's owner and group. Alterna- 
tively, the caller could be a naive user knowing nothing about ex- 
tended protection and simply adding rights for the owner of the file, 
and using chmodQ because it was the only UNIX system call he 
knew. 



Architectural Principles 

Where we could not completely meet our primary goal of "ex- 
pected" behavior from UNIX system calls in the presence of ex- 
tended protections, we felt there should be architectural principles 
from which the behavior can be derived. We identified the follow- 
ing principles as being important: 

• In the absence of extended protection information, the 
protection system must exactly implement the UNIX se- 
mantics. It must be easy to configure a system to use only 
UNIX protections; having done so, the extended protec- 
tion system should be invisible to the UNIX user who does 
not use extended UNIX commands. This principle was 
met by the system for specifying initial protections. 

• The behavior of the UNIX system calls should not be de- 
pendent on the identity of the user making the call. For 
example, stat() should return the same value for the pro- 
tection mode irrespective of the identity of the caller. 

• UNIX system calls should always err towards increased se- 
curity. For example, if any user has read access to a file, 
stat() should represent that fact, even if that means over- 
representing the rights of some users. 

• It must be possible, using UNIX system calls, to disable the 
effects of the extended protection system. Using chmod() 
to deny group and world access to a file should also disable 
rights granted through extended entries. This follows from 
the previous principle: err towards increased security. 

These principles match existing UNIX standards; the IEEE POSIX 
specification [8], for example, demands that chmod() disable ex- 
tended protection information. 
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The Integrated Protection Model 

The model for integrating the extended protection system into the 
UNIX protection model is derived from the architectural principles 
listed above. The protection information for the owner and group 
of a file, as returned by statQ and controlled by chmod(), is pre- 
cisely the protection information for the file's owner and group as 
maintained by the extended protection system, while the protection 
information for "others" is a summary of all other protection infor- 
mation maintained by the extended protection system. This sum- 
mary is normally the logical OR of the organization rights, world 
rights, and all rights granted by extended entries; but may be modi- 
fied by chmod(), as described below. To improve performance, 
the inode includes this summary information; it serves a dual pur- 
pose: 

The stat() call returns the summary information as the "others" 
rights. For example, if any user other than owner and group has 
rights to read a file, the "others" rights returned by stat() will in- 
clude read rights. 

When checking access rights, the system masks the extended pro- 
tection information in extended entries with the summary informa- 
tion from the inode. Chmod() sets the summary information to the 
"others" rights supplied in the chmod(), permitting chmod() to 
disable extended entries if desired. 

Consider a file with the following protections: 



Owner : frank prw- 

Group: acct__r -r — 

Org: finance -r — 

World: 

Extended Entries: 

donna . hdr . f i nanc e -rw- 

The protection mode returned by stat() will be rw-r — rw-. If the 
owner of the file changes the file's protection mode to rw-r — r — , 
this changes the summary information to r — . By the masking op- 
eration described above, this will effectively remove the Write rights 
granted to donna. hdr. finance by the extended entry. 



6-12 Protection Model Extensions 



Note how this protection model meets the goals described above: 

• Stat() is fast because it never has to examine the extended 
entries. It returns the same protection rights irrespective of 
the identity of its caller. 

• Stat() errs in the direction of increased security: if any 
user has Write rights to a file, this fact will be reflected in 
the "others" information returned by statQ. 

• Chmod() can completely disable the extended entries. 

We have also added a set of utilities that allow UNIX users to list, 
copy, and edit ACLs. 



Implementation Description 

This section provides the reader with a description of some of the 
more interesting aspects of the implementation of the extended pro- 
tection system. 



Storing Protection Information 

Protection information is abstracted into two pieces: required infor- 
mation and extended information. The object's inode contains the 
required information. Extended information, if present, is stored in 
a separate object called an ACL object that is pointed to by the 
object's inode. ACL objects are immutable and read-only; once 
they are created they may not be changed. Changing the protection 
associated with an object is accomplished by creating a new ACL 
object. ACL objects are shareable; one ACL object may specify 
the extended protection information for many data objects. ACL 
objects have reference counts; when the last reference to an ACL 
object is removed, the ACL object is deleted. 

Protection information for initial file and initial directory ACLs is 
implemented in a similar manner. A header in the directory con- 
tains the initial file and initial directory required information. Sepa- 
rate ACL objects contain extended information, if present. When 
applying an extended ACL to a newly created file, the reference 
count on the ACL object is merely incremented, reducing file crea- 
tion costs. 
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A file system scanner allows ACL objects with identical protection 
information to be compacted, reducing disk space required for 
ACL objects. 



Caching of Protection Information 

Several levels of caching improve the performance of the protection 
system. The first cache is the inode cache. Whenever an object's 
attributes are examined, they are placed into an operating system 
cache. The primary purpose of this cache is to maintain location 
and object information. However, since the inode includes the re- 
quired portion of the protection information, this information is 
cached as well. 

When a file is opened, the protection rights associated with the 
opening process are maintained in an open file table. This informa- 
tion is available for later rights checking. With this cache rights 
checking can immediately provide rights information without send- 
ing network messages to the owning node to re-determine the rights 
available. 

The contents of ACL objects are also cached. Each extended ACL 
object is identified by a unique id (UID) [2] . This cache provides 
UID to extended ACL information mapping. Whenever it is neces- 
sary to consult the information stored in an ACL object, this cache 
is consulted. If the cache does not contain the ACL object, a 
cache replacement algorithm selects an entry to replace; the ACL 
object is then read into the cache entry. Because users tend to 
protect objects in a few unique ways, there is a high probability of 
locating the ACL information in the cache. 

Finally, there is a higher level cache used when copying objects 
from volume to volume. Each ACL object must reside on the same 
volume as the object it protects. As a result, if an object with an 
associated ACL object is copied from one volume to another in a 
mode preserving protection information, it is necessary to create a 
new ACL object on the destination volume. To minimize the num- 
ber of new ACL objects created a cache keeps track of source ACL 
object to destination ACL object mappings. This means that if a 
complete tree is copied from one volume to another, objects shar- 
ing ACL objects on the source will share ACL objects on the desti- 
nation. 
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Using a Copy Semantic 

For copying protection information from one location to another, 
we decided to use a copy semantic. What we mean by a copy se- 
mantic is the following: when a user program wishes to copy protec- 
tion information from one object to another, the program identifies 
the source, destination, and source and destination types and calls 
a protection copy procedure. Directories require the type informa- 
tion, as they have three sets of protection information (i.e., the 
directory protection information, the initial file information, and 
the initial directory information) . The copy semantic seems to be a 
natural way to handle protection information. In addition, it has 
the advantages of making protection copying easy and of insulating 
programs from the low-level protection information. This contrasts 
with another frequently used method, where a program gets the old 
information and then applies it to a new object; this method re- 
quires the program to allocate temporary data structures of correct 
types to hold the protection information. Different calls may also 
be necessary for different object types. We expect that a system 
call interface to ACLs would include a call based on the copy se- 
mantic. 



Using the Protection System 

This section provides several examples of how we use the extended 
protection mechanism to implement various procection policies. 



Using BSD4.3 Style Protection Policy 

A person using strictly BSD4.3 style protection information would 
not use extended entries at all. Required information would specify 
the protection information. An example of the protection infor- 
mation for a file might be: 



Owner: 


frank 


prwx 


Group : 


osdev 


-rwx 


Org: 
World: 


r__d 


[ignored] 
-r-x 



Notice that the owner has Protect rights, allowing him to change the 
protection information for the file. 
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Directories would have protection information set so that the 
BSD4.3 protection policy would be applied to files and directories 
created within the directory. An example of initial file information: 



Owner : 


[from process] 


[specified 


by process] 


Group : 


osdev 


[specified 


by process] 


Org: 


[from process] 


[ignored] 




World: 




[specified 


by process] 



By the notation [from process] we mean that the corresponding 
field of the creating process's SID is substituted. By the notation 
[specified by process] we mean that the rights are generated by 
taking the rights supplied by the process to open() or creat(), and 
modified by the process's umask. If a process has an effective SID 
of mark. testing. r_d, specifies rights of 775 to open() and has a 
umask of Oil, the resulting protection information would be: 



Owner : mark prwx 

Group: osdev -rw- 

Org: r_d [ignored] 

World: -r- — 



Using System V.3 Style Protection Policy 

System V.3 style protection information is similar to BSD4.3 pro- 
tection information. The main exception is in the initial protection 
in a directory. An example of initial file protection information: 



Owner: [from process] [specified by process] 

Group: [from process] [specified by process] 

Org: [from process] [ignored] 

World: [specified by process] 

The notation is the same as in the previous example. 



Using UNIX Protection Plus Simple Extended Entry 

A UNIX user can take advantage of the extended protection system 
by adding extended entries. For example, if the owner of a file is 
frank. acct r.finance and he wishes to allow donna. hdr.finance 
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the ability to write the file while denying others write access, he 
would protect the file as follows: 



Owner: 




frank 


prw- 


Group : 




acct_r 


-r — 


Org: 




finance 


-r — 


World: 









Extended 


Entries: 




donna . 


. hdr. finance 


-rw- 



Note that the world entry has been given no rights. 



Using More Sophisticated Protection 

This example shows how one might set up initial protections on a 
directory to enforce a policy that preserves information about an 
object's creator while maintaining explicit protections: 



Owner: [from process] 


[ignored] 


Group: [from process] 


[ignored] 


Org: [from process] 


[ignored] 


World: 


-r — 


Extended Entries: 




frank. acct__r. finance 




john.acct_p 


-rw- 


acct_jr 


-rw- 


finance 


_ r — 



prw- 



When files are created in the directory, the required entries will 
record the SID information of the creating process, while the ex- 
tended entries control the protection rights available. If 
mary.acct_r. finance creates a file in this directory, its protection 
will be: 



Owner: 


mary 


[ignored] 


Group : 


acct_r 


[ignored] 


Org: 


finance 


[ignored] 


World: 




-r — 


Extended 


Entries: 




frank. acct_r. finance p 


john. ace t_jp 


-rw- 


acct__r 




-rw- 


finance 




-r — 
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Conclusions and Lessons Learned 

Some of the decisions made during the design and implementation 
had wide-reaching effects in the operating system. For example, 
we decided to make UNIX ids (the small integers associated with 
UNIX user ids) part of the object attributes to speed up stat() 
performance. This decision implies that the system must be careful 
when manipulating SIDs so that the UNIX ids are always available 
for file creation. We wrote a file system scanner that takes the 
unique ids associated with SID information (i.e., the person, group 
or organization) and recalculates any UNIX ids that are incorrect. 
This would normally only be necessary for conflicting UNIX ids 
when merging networks. During implementation it was a useful tool 
for tracking and repairing incorrect id information. 

Reconciling the different origins of the new system was often diffi- 
cult. This was partially described in the discussion of integrating 
ACLs with UNIX protections. Often the Aegis way of performing 
an operation conflicted with the way that the UNIX system per- 
forms an operation. Sometimes it was possible to compromise and 
allow both ways to work (i.e., initial protections). In other cases it 
was necessary to just do something the UNIX way (see the discus- 
sion of chmod in "Integrating ACLs with UNIX Protections" in this 
article) . Often it was not a question of which alternative was right or 
wrong, but merely that the two systems had chosen to do things in a 
different manner. 

Compatibility with previous operating systems is a feature that 
Apollo feels is important in a network operating system. Worksta- 
tions are encouraged to reference data residing on other worksta- 
tions. Compatibility adds a series of problems that would not occur 
in a system where ties between machines are weaker. For a major 
change to the operating system such as the extended protection 
system changes described in this paper, compatibility is a major 
concern. We would estimate that 50% of the new code associated 
with the extended protection system was compatibility code. 

Our protection system includes the concept of super-user. In a net- 
work of workstations, this concept causes problems in the admini- 
stration of workstations. It would be desirable to eliminate or mod- 
ify the way super-user is implemented (a possible alternative is de- 
scribed in [1]); we have left this as an issue in future work. 
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We believe that we have shown how Access Control Lists can be 
viewed as an extension by UNIX programs, how they allow better 
granularity over the control of access to objects in a file system, and 
have provided information describing our implementation. 
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A User Account Registration System 
for a Large (Heterogeneous) UNIX Network 

by 
Joseph N. Pato, Elizabeth Martin, Betsy Davis 



Abstract 



Three problem areas arise when considering a user registration sys- 
tem for a large heterogeneous distributed computing environment. 
Large environments demand controls on the complexity of admini- 
stration. Heterogeneity requires an examination of the notion of 
identity in the network as well as the interoperability of software on 
different hosts. Distribution raises the problems of availability, reli- 
ability, and security. 



Generally available UNIX environments (BSD4.3 and AT&T 
SYS 5. 3) provide few tools for solving these problems. Account 
administration is typically handled through manual editing of a sin- 
gle /etc/passwd file. Consistency is maintained on multiple ma- 
chines by periodically copying the /etc/passwd file to each machine 
in the network. For large networks with thousands of users and 
machines, these mechanisms are clumsy and error prone, and they 
vest too much power in a single system administrator. 



RGY is a replicated user registration system built on Apollo's port- 
able Network Computing System (NCS). The system consists of a 
set of daemons which maintain a replicated user registration data- 
base. Remote access to the user registration database is provided at 
each client site through remote procedure calls in a portable sub- 
routine library that replaces the getpwent(3) and getgrent(3) C 
library calls. Weakly consistent replication provides a high degree of 
availability and reliability. Propagation of individual updates is per- 
formed yielding an inexpensive mechanism for maintaining consis- 
tency. Updates are securely performed using authenticated inter- 
faces, allowing any client site to update the database. 



Copyright © 1988 Apollo Computer, Inc. Unpublished, all rights 
reserved. 
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Introduction 



In a conventional single host UNIX environment, system account 
administration is managed through manipulation of the Vetc/passwd 
and /etc/group files. Generally a system administrator is responsi- 
ble for properly editing these files as well as performing the system 
house-cleaning associated with the arrival and departure of users. 
When UNIX was primarily an operating system associated with de- 
partmental minicomputer environments that consisted of few (un- 
der 100) users, the burden on the administrator was tolerable. 



In the mid 1980's, networks of inexpensive UNIX workstations be- 
came common, providing a workstation owner with a high degree of 
autonomy, as well as guaranteed response time in the absence of 
time-sharing. As long as each workstation or cluster of workstations 
remained autonomous, the administrative burden associated with 
each machine remained low. Account management could be dele- 
gated to members of the user community for each workstation or 
cluster. 



With this autonomy, however, early workstation users also encoun- 
tered isolation. Data and resource sharing became cumbersome, 
relying on bulk data transfer protocols like FTP and virtual terminal 
protocols like Telnet. To recover some of the cooperation found in 
time-sharing systems, computer vendors introduced distributed file 
systems like Apollo's Domain [6], Sun's NFS [10] and AT&T's 
RFS [9], and later developed network computing environments, 
like Apollo's Network Computing System (NCS) [4] and Sun's 
ONC [11]. Network computing environments provide heterogene- 
ous compute and resource sharing while distributed file systems pro- 
vide data sharing. Distributed file systems can be considered a sub- 
set of network computing environments. Therefore, for the pur- 
poses of this paper, we will use the term network computing envi- 
ronment to refer to either of these forms of network resource shar- 
ing. 



A network computing environment transforms a network of work- 
stations from independent administrative jurisdictions to a federa- 
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tion of loosely coupled systems. For access control mechanisms to 
be meaningful, every system must share a single representation for 
users' credentials (user names and user IDs). With independent 
workstations, the assignment of user names and user IDs needs to 
be unique only on each machine; in a network computing environ- 
ment, this assignment must be unique across the network. 



Since a user's credentials must be unique across the network, sys- 
tem administrators can no longer delegate account management re- 
sponsibilities to individual workstation user communities without 
compromising the security of other user communities in the feder- 
ated system. In contrast to isolated workstations, which can diffuse 
the administrative burden of account management, a network com- 
puting environment forces account management responsibilities to 
be assumed by a network administration authority. This network 
administrator accumulates the requirements of each workstation us- 
er community and then must redistribute the account information 
to each workstation in the federation. 



RGY, a replicated user registration system built on Apollo's NCS, 
has been developed to allow the administration of large network 
environments. Our goal is to provide a network user registration 
system that will work well in a network of tens of thousands of hosts 
and users. To accomplish this we have developed the replicated 
user registration database, RGY, to serve as the secure repository 
for network system account management information. Access to 
the RGY database is provided through NCS remote procedure call 
interfaces exported by a collection of daemon processes known as 
RGYDs. 



Hosts that wish to participate in the federated system access the 
RGY database through existing getpwent(3) and getgrent(3) C li- 
brary calls as well as through additional query and update primi- 
tives. Each host is the final authority in granting access to its re- 
sources. The RGY database allows the federated hosts to provide a 
consistent view of the user community, but each host is free to filter 
the information from the RGY database to restrict access, or to 
correct for differences in the local file system. 
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This paper examines the following topics: 

• Existing mechanisms for coping with network account ad- 
ministration 

• The data model and tools we developed to allow for divi- 
sion of labor in maintaining the user registration database 

• Providing highly available, reliable and efficient access to 
the RGY database 

• The mechanism to secure update access to the RGY sys- 
tem 



Existing Systems 



Password file maintenance is a real and present problem for admin- 
istrators of large UNIX installations. The 1987 USENIX Large In- 
stallation System Administrators Workshop drew numerous position 
papers on UNIX account management and the distribution of infor- 
mation across networks. At this workshop, attendees discussed the 
evolutionary processes that result in large installations and the need 
to unify account information in the resulting network [3]. Other 
attendees described the use of structured editors for adding entries 
to the password file and the subsequent semi-automatic copying of 
the file to all hosts in the network [7]. These current approaches, 
centered around the existing UNIX data and administrative models, 
are cumbersome in today's small networks (fewer than 100 hosts) 
and hold little promise for the large (thousands of hosts) networks 
of the future. 



Remote access to the login account database allows replication 
strategies to limit their focus to a strategic subset of the network. 
Sun's Yellow Pages (YP), part of ONC, is a simple network lookup 
service that has been used to provide remote access to password file 
information. While YP did not modify the /etc/passwd data 
model, it did introduce the use of remote procedure calls to re- 
motely access the UNIX login account database. Outside the UNIX 
environment, the Xerox Grapevine system [2] has addressed many 
of these issues. Grapevine was intended to be primarily used as a 
delivery mechanism for a large, dispersed computer mail system. It 
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did not directly address the issues of maintaining a replicated user 
login account database. It maintained a replicated database of mail 
users which, in some instances, could also be used for user logins. 
Its concern for the issues of authentication, access control, decen- 
tralized administration, replication and scalability has served as in- 
spiration for much of the work described in this paper. 



Administrative Model 



Large computing environments tend to contain a large number of 
both users and machines. Frequently these consumers and re- 
sources span administrative or organizational domains. To accom- 
modate this type of environment we have enhanced the UNIX 
identity model to include the notion of organization. In addition, 
we have added access control objects to each entry in the user reg- 
istration database. These changes allow mutually suspicious system 
administrators to cooperatively manage a logically partitioned user 
registration database. Administrative complexity is further reduced 
through the use of a structured editor for database manipulations. 



The RGY Data Model 



The RGY user registration system maintains a database consisting of 
naming information for people, groups and organizations, login ac- 
count information for people, and general system properties and 
policies. In the /etc/passwd file people and accounts are combined 
in a single record. We, however, feel that people and accounts are 
distinct objects in a user registration system: accounts represent ac- 
tive roles that people can play when accessing the system, whereas 
people maintain passive roles through the ownership of files, receipt 
of mail, etc. that persist independent of the existence of an ac- 
count. 



Groups and organizations are collections of people. Groups retain 
their conventional UNIX semantics and exist to allow a collection 
of people to share privileges to system objects. Organizations pro- 
vide another dimension for sharing. Apollo has extended the UNIX 
file protection model from user, group, others (rwxrwxrwx) access 
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to user, group, organization, others access. Organizations can be 
used just like groups, where a person can be a member of any num- 
ber of groups. More typically, however, we use organizations as a 
means of partitioning the global user community into administrative 
jurisdictions, where each person belongs to a single organization. 



In addition to maintaining information about the users and logical 
groupings of the networked system, the RGY database contains sys- 
tem policy information. Policy information consists of system con- 
figured minimum password length, password content restrictions, 
password expiration lifetime, absolute password expiration date, 
and account lifespans. Policy is never enforced by the user registra- 
tion system. It exists as a guide for clients of the system. 



Naming Information 



The naming database is divided into three relations, also referred to 
as naming domains, that establish the existence of individual per- 
sons, groups, and organizations within the registry. An entry in one 
of these naming domains is called a PGOitem. A PGOitem estab- 
lishes the binding between a name and a set of credentials which 
consist of a unique identifier (UID) and a unix id. The unix id, 
preserved for compatibility with password file entries, is a small in- 
teger value used as a user id for people, a group id for groups, and 
an org id for organizations. Aliases, multiple names mapping to the 
same credential information, are allowed to exist. 



PGOitems contain a fullname field, an owner field and miscellane- 
ous properties. A PGOitem can contain a list of typed mail data. 
This data consists of a type code and an uninterpreted printstring. 
The printstrings may be interpreted by a system mailer and usually 
define the preferred delivery mechanism or mailbox to be used. 



Groups and Organization PGOitems have associated membership 
lists. A membership list enumerates the people that have the rights 
and privileges of the group or organization. Organization PGOitems 
may also contain policy information. By establishing policy, an or- 
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ganization may impose stricter password and account discipline 
than the other organizations in the registry. The actual policy data 
for an organization can be retrieved for editing, but most operations 
that yield policy information return the effective policy data. To 
determine an organization's effective policy, the system compares 
the organization policy information with the base registry policy and 
returns the most restrictive value for each field. 



Login Accounts 



Accounts contain a superset of the information stored in the /etc/ 
passwd file. An account entry is divided into two portions. The 
user portion of the account contains the home directory, login shell, 
password, and gecos fields. The administrative portion contains 
information about the creator of the account, account expiration 
date and other information to indicate the validity of the account. 
An account defines a subject identifier (SID). A SID is a UID 
triplet which identifies the person, group, and organization that cor- 
respond to the account. 



UIDs [5], which are used extensively throughout the Apollo sys- 
tem, are a 64-bit concatenation of the current time and host net- 
work address. Unlike the unix ids which are assigned by the system 
administrator, UIDs are generated for the PGOentry by the RGY 
system and are guaranteed to be unique. 



Accounts are keyed by login name, which is the concatenation of 
the person name, the group name and the organization name sepa- 
rated by periods (e.g., the user smith might have the account 
smith. sys.r_d). Login names can be abbreviated; accounts define 
the minimum abbreviation necessary for their selection. In the ex- 
ample above, the account smith. sys.r__d could be accessed as 
smith. sys if the associated abbreviation was person and group, or 
as smith if the associated abbreviation was simply person. Each 
person may have multiple accounts, either by using aliases, or by 
creating accounts with different abbreviations. 
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Decentralized Administration 



Owner fields in PGOitems and registry properties allow mutually 
suspicious system administrators to securely partition the admini- 
stration of the RGY database. An owner field defines who can 
update the corresponding record. 



Access Controls 



The RGY database maintains certain access controls when updating 
information in the database. Only the registry owner, stored in the 
registry properties, can update the registry properties or policy. The 
registry properties also contain owner records for each of the nam- 
ing domains. Only the owner of each naming domain can create 
new PGOitems in that domain. 



When a PGOitem is created, it is assigned an owner. All future 
manipulations of that PGOitem can only be performed by the 
owner. Group and organization membership lists can only be ma- 
nipulated by the owner of the group or organization PGOitem. 



Accounts can be created only by the owner of the corresponding 
person PGOitem. To have an account that is affiliated with a spe- 
cific group and organization, a person must first be a member of the 
corresponding group and organization. Update of the administra- 
tive portion of accounts is reserved to the owner of the correspond- 
ing person PGOitem; updates to the user portion of an account can 
only be performed by the corresponding person, or by the owner of 
the corresponding person PGOitem. If a person is deleted from a 
group or organization membership list, then any accounts that may 
exist for that person in the group or organization are also deleted. 
Group and organization membership as a pre-condition for account 
existence is an invariant that is maintained by the RGY database. 
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The Representation of Owners 



An owner field is represented by a SID where any constituent field 
may be replaced by a wildcard (represented by the character %). 
Owner permissions are granted to anyone who can login with an 
account that has a SID that matches the owner SID. If the owner 
SID contains a wildcard for one of the constituent fields, then any 
value for that field will match. In a free-wheeling system, all the 
owner fields could be set to %.%.%, thus allowing anyone to ma- 
nipulate all the data in the RGY database. Owner records of the 
form %.rgy_admin.% would grant permission to anyone logged in 
with an account that had rgy_admin as the group portion of its 
SID. 



If all owner fields in the RGY database are the same, then update 
access to the RGY database is comparable to update access to the 
/etc/passwd file. When the RGY database contains a large number 
of people, however, it is more likely that each user community will 
have the PGOitems corresponding to its members owned by an ad- 
ministrator from within that user community. 



Example 



For the purposes of this example, we will assume that the network 
and machines in the federation are secure. The only security risk 
we are concerned with is the access to or corruption of data by a 
person with a valid but unauthorized account. 



In a large corporation, a small group of researchers are working on 
a sensitive project called the manhattan project. To protect the 
confidentiality of their work, they have protected their files so that 
access is limited to members of the manhattan group. The re- 
searchers could disconnect their machines from the corporate fed- 
eration and ensure their security, but to do so would unduly disrupt 
their work. How do they guarantee that no one outside the group 
acquire an account with access to their data? 
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The first step is to assign the ownership of the manhattan group 
PGOitem to a member of the manhattan group (e.g., teller. man- 
hattan. research). This will guarantee that only the user teller, who 
is a member of the manhattan group can add or delete members 
from the group. For most situations this will be sufficient, but for 
truly security conscious environments more must be done. 



Assume that the user fermi is a member of the manhattan group. 
Further assume that the owner of fermi's person PGOitem is mali- 
cious.spy.%. It would be a simple matter for malicious to change 
the password on fermi's account and thus compromise the man- 
hattan group's data. To be fully secure, the administrator of the 
manhattan group (teller) should allow people to be members of 
the group only if he is also the owner of their person PGOitems. 



Structured Editing 



The edrgy tool is used to manage the naming, account, and policy 
information in the RGY database. It is an interactive editor that 
provides users and system administrators with a structured interface 
to the user registration system, at once ensuring consistency, se- 
mantic correctness, and timely availability of changes. 



Edrgy is aware of the semantic constraints placed upon the con- 
tents of the RGY database, and of the policy that is in effect. Edrgy 
uses this knowledge to assist the system administrator in performing 
semantically correct operations. For example, if the system admin- 
istrator attempts to add an account for a person that does not be- 
long to the requested group, and the administrator has rights to 
update the group, then edrgy will first add the person to the group 
and then add the account. Warnings are given before an operation 
is performed if that operation may have side effects. For example, 
edrgy will warn that the deletion of a group will also delete any 
accounts that exist with that group's permissions. 



7-10 User Account Registration 



System Structure 



The RGY user registration system is composed of two distinct por- 
tions: the database, which is an NCS replicated object, and the 
client agent which provides RGY access for the host environment. 



RGYD: the RGY Daemon 



RGYD is the NCS server (process) that exports remote interfaces to 
the RGY database. Three classes of interfaces are exported by the 
RGYD: database queries and updates, replica control, and database 
update propagation. 



Database Operations 



Database operations involve the maintenance and use of the RGY 
database. RGYD exports interfaces to query and update all RGY 
structures directly. In addition, the RGYDs maintain a set of inter- 
faces presenting a view of the database that is equivalent to the view 
presented by the getpwent(3) and getgrent(3) C library functions. 
By extracting information from the corresponding PGOitem and ac- 
count records RGYD constructs password file entries. Group and 
org file records are constructed from the corresponding PGOitem 
and membership lists. 



The RGY database is kept in virtual memory as a forest of balanced 
binary trees [1] yielding efficient (<9(log n) operations where n is 
the number of items in each relation) query and update access. 
Deleted items are marked and left in the trees until garbage collec- 
tion is performed during a checkpoint. Updates are first applied to 
the in memory data structures and are then atomically recorded in 
a stable storage log. Checkpoints of the in-memory data structures 
are taken every few hours for each relation that has been modified 
since the last checkpoint. The RGYD automatically recovers the 
state of the RGY database after a system crash by reloading the last 
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checkpoint state and then re-executing each operation recorded in 
the stable storage log. 



Not all UNIX programs access the password file structures through 
the procedural interfaces provided by the C library. To accommo- 
date these programs, the RGYD maintains ASCII file versions of 
the password, group, and org file. These files are recreated at each 
checkpoint if the data in these views has been modified. 



Replica Management 



A collection of RGYD processes spread across a number of hosts 
cooperate to maintain a weakly consistent replicated database. Up- 
dates do not occur at all RGYDs simultaneously; instead, one of the 
RGYDs is selected to serve as the master site and becomes the only 
daemon that accepts database updates. The master RGYD then as- 
sumes the responsibility of propagating each update to the other 
cooperating (slave) RGYDs. 



RGYD sites may come, go, and move around with ease. When a 
slave RGYD first starts running, it locates the master RGYD through 
the NCS Global Location Broker and announces its existence. If 
this RGYD is a new site, then the master RGYD will initialize the 
slave and record an operation to inform all other slaves of its exis- 
tence. If the new RGYD is an existing site that has moved to a new 
address, the master RGYD will record this change of address and 
inform the other replicas. In this way each RGYD maintains a cur- 
rent copy of the replica list. 



A special tool, rgy_admin, is used to remotely inspect and control 
each RGYD. All operations that affect the state of a RGYD are 
reserved to the owner of the registry database as recorded in the 
RGY properties data. With the rgy_admin tool, the registry owner 
can determine if replicas are out of date, cause a replica site to be 
reinitialized, select a new master site and decommission a RGYD 
site. When a RGYD site receives the decommission request, it 
purges its database and terminates execution. 
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Update Propagation 



In addition to managing the database and replica list, the master 
RGYD also manages a propagation queue. The role of the propaga- 
tion queue is analogous to the stable storage log. Every update 
operation performed at the master RGYD is recorded in the propa- 
gation queue for later application at the slave RGYDs. In practice, 
the propagation queue and the stable storage log are the same struc- 
ture. Slave RGYDs are free to truncate the stable storage log once 
a checkpoint has completed, but the master RGYD must preserve 
the portion of the log that remains to be propagated to each slave. 
Update propagations, like all other remote operations, are accom- 
plished through the use of a remote procedure calls. 



A simple protocol between the master and slaves ensures that up- 
dates are processed in serial order. The master RGYD applies a 
monotonically increasing timestamp to each update it records. 
When propagating an update, the master RGYD transmits the previ- 
ous update timestamp as well as the current update timestamp. 
Retransmitted updates are simply ignored by the slaves, but if the 
slave detects that it is out of date with respect to the previous up- 
date it requests to be reinitialized. 



The master RGYD periodically retransmits an update to a slave 
which is unreachable. As the number of attempts to reach the slave 
increases, the time interval between retransmissions is also in- 
creased. Eventually the master will mark a slave as out of touch and 
and will re-initialize the slave when it finally becomes reachable. 
Database initialization is accomplished through bulk transfer of the 
database state to the target slave RGYD. Thus the master may 
purge updates from its propagation queue even when some slaves 
are unreachable for long periods of time. 



The RGY Client Agent 



The RGY Client Agent (RCA) is divided into two components. The 
most primitive level consists of the automatically generated client 
side RPC stubs for the registry operations and code for binding to a 
RGYD. The next, optional layer of the client agent (RGYC) pro- 
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vides registry services in the event of network failure. This layer can 
be used for translating credentials in a heterogeneous environment 
or for filtering data from the network registry. The first time a RGY 
operation is performed, the RCA contacts the NCS Global Location 
Broker to randomly select a RGYD site for the operation. Subse- 
quent operations are directed to the same RGYD until that server 
becomes unavailable. To allow the client agent to remain unaware 
of the replication strategy chosen by the RGYDs, we have divided 
RGY operations into separate query and update interfaces. The 
RCA actually maintains two bindings, one for queries and the other 
for updates. With the master/slave replication currently imple- 
mented by the RGYDs, only one server will register the update in- 
terface at a time. An explicit binding operation is provided by the 
RCA for registry editing applications. This allows the application to 
force queries and updates to be delivered to the same RGYD. 



The Local Registry 



The local registry, maintained on each node by the RGYC, provides 
a cache of user registration data in the event that a registry server is 
not available. This cache of recently used accounts supports que- 
ries for login and C library calls (getpwent(3), getgrent(3)). In 
order to prevent the cache contents from becoming stale, each re- 
mote operation returns the timestamp of the last operation that may 
have invalidated the cache. If the cache is out of date, the client 
agent initiates a cache refresh operation. 



Authentication 



Ignoring well-known security holes in UNIX systems, it can be said 
that access to files is vigorously protected by the operating system. 
When deciding to grant access to a file, the kernel is free to believe 
the identity information that it has stored for the process. In a 
network computing environment, however, there is no reason for 
one host to believe that another host has not been compromised. 
An application cannot even be sure that network messages truly 
originated with the host listed in the message. 
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The traditional mechanism for proving identity is the use of a secret 
that is known only to the two principals, the person claiming the 
identity and the guardian of the resource. In a network environ- 
ment where principals reside on different hosts, encryption must be 
used when exchanging the secret in order to maintain the secret. 
Our model for network authentication is inspired by Needham and 
Schroeder's work [8]. 



All RGYD update and replica administration operations that apply 
access controls perform authentication as the first step in those con- 
trols. To perform authentication, two encryption keys are associ- 
ated with each account: a login key which is constructed from the 
plain text of the account's login password, and a master key which 
is generated by the RGYD when the account is created. When an 
update request is received by the RGYD, it constructs a random bit 
pattern and encrypts it with the requester's master key. The RGYD 
then makes an RPC callback to the initial requester. This callback 
is a challenge that requires the requester to decrypt the message, 
perform a function on the bit pattern, and return the encrypted 
result. 



To successfully meet the authentication challenge posed by the 
RGYD, the requesting process must possess the valid master key for 
the claimed identity. The login (/bin/login) and set user id (/bin/ 
su) programs have been modified to acquire the valid master key. 
Rather than using the standard getpwent(3) calls to retrieve the 
password file record, these programs now make a direct RCA call 
that retrieves the required information as well as the master key. 
To protect the master key, it is never transmitted in the clear. It is 
first encrypted with the login key for the account. In this way the 
master key will be useful only if the login program possesses the 
valid password for the account. If the password is not known, the 
master key will not be decrypted properly. 



The mechanism described above is used to guarantee that the RGY 
database is never modified by unauthorized users. In a security 
conscious environment, it is also necessary to verify that the client is 
connected to a legitimate RGYD. Mechanisms to accomplish this 
task are inherent in the system, but are beyond the scope of this 
paper. 
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Conclusions 



The RGY system is currently used by about 100 personal worksta- 
tions on the Apollo corporate internet. The RGY database contains 
about 2500 users, 100 groups and 50 organizations. On an average 
day about 150,000 database operations are performed spread out 
over the 4 RGYDs maintaining the replicated database. While these 
numbers are small compared to our design goals, we are encour- 
aged to see that we are not yet close to saturating the capacity of a 
single RGYD even when only one server is running. 



We are currently investigating new replication algorithms that will 
allow us to perform updates at any RGYD site, rather than at only 
the master RGYD site. These algorithms maintain the semantic 
invariants in the database, and will improve update availability in 
the face of network failures. 
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The Network Computing Architecture and System: 

An Environment for Developing 

Distributed Applications 

by 

Terence H. Dineen, Paul J. Leach, Nathaniel W. Mishkin, 

Joseph N. Pato, Geoffrey L. Wyant 



The Network Computing Architecture (NCA) is an object-oriented 
framework for developing distributed applications. The Network 
Computing System (NCS) is a portable implementation of that ar- 
chitecture that runs on UNIX and other systems, including Domain/ 
OS. By adopting an object-oriented approach, we encourage appli- 
cation designers to think in terms of what they want their applica- 
tions to operate on, not what server they want the applications to 
make calls to or how those calls are implemented. This design in- 
creases robustness and flexibility in a changing environment. 



Introduction 



NCS currently runs under Apollo's Domain/IX [7], Domain/OS, 
4.2BSD and 4.3BSD, and Sun's version of UNIX. Implementations 
are currently in progress for the IBM PC and VAX/VMS. Apollo 
Computer has placed NCA in the public domain. 

In addition to its object orientation, some interesting features of the 
system are as follows. It supplies a transport-independent remote 
procedure call (RPC) facility using BSD sockets as the interface to 
any datagram facility. It provides at-most-once semantics over the 
datagram layer, with optimizations if an operation is declared to be 
idempotent. It is built on top of a concurrent programming support 
package that provides multiple threads of execution in a single ad- 
dress space, although versions can be made for machines that just 
have asynchronous timer interrupts. 



Copyright ® 1987 Apollo Computer, Inc. Unpublished, all rights 
reserved. 
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The data representation supports multiple scalar data formats, so 
that similar machines do not have to convert data to a canonical 
form, but can instead use their common data formats. The RPC 
interface definition compiler is extensible. Procedures to do the cli- 
ent/server binding can be attached to data types defined in the in- 
terface. Also, complex data types can be marshalled by user-sup- 
plied procedures which convert such types to data types the com- 
piler understands. There is a replicated global location database: 
Using it, the locations of an object can be determined given its ob- 
ject ID, its type, or one of its supported interfaces. 

There are several motivations for NCA. Large, heterogeneous net- 
works are becoming more common. Users of systems in such net- 
works are often frustrated by the fact that they can't get those sys- 
tems to work cooperatively. Over the last few years, advances have 
been made in allowing data sharing to occur between the systems, 
but not compute sharing. Tools to allow the effective use of the 
aggregate compute power have not been available. The inability to 
share computing resources has become even more aggravating as 
more specialized processors (e.g., ones designed to run numerical 
applications fast) have become more widespread. Current "technol- 
ogy" obliges users of those processors to resort to FTP and Telnet. 
Even in an environment of systems of relatively similar power, a 
network computing architecture is called for: There are applications 
that can take advantage of many systems in parallel. (Parallel 
"make" is the most obvious example.) Also, replicating resources 
over a number of machines increases the reliability seen by users of 
the network. 

It is important to understand that there is almost no "network appli- 
cation" that can't be implemented without NCA/NCS. However, 
the implementation is bound to be more difficult, less general, and 
harder to install on a variety of systems. Further, experience has 
shown that some obviously useful network applications simply don't 
get written because of these problems. The existence of NCA/NCS 
helps to solve these problems and as a result, expand the set of 
network applications. 
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Architecture 

The figure illustrates NCA's overall structure. 
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NCA's Overall Structure 



Heterogeneous Interconnect 

The lowest level provides the basic interconnection to heterogenous 
computing systems. At this layer NCA currently defines a remote 
procedure call protocol (NCA/RPC), a Network Interface Defini- 
tion Language (NIDL), and a Network Data Representation 
(NDR) . RPC is a mechanism that allows programs to make calls to 
subroutines where the caller and the subroutine run in different 
processes, most commonly on different machines. The RPC ap- 
proach and an implementation similar to ours is described in detail 
by Birrell and Nelson [2]. NIDL is a high-level language used to 
specify the interfaces to procedures that are to be invoked through 
the RPC mechanism. NCS includes a portable NIDL compiler that 
takes NIDL interfaces as input and produces stub procedures that, 
among other things, handle data representation issues and connect 
program calls to the NCS RPC runtime environment that imple- 
ments the NCA/RPC protocol. The relationships among the client 
(i.e. the caller of a remoted procedure), server, stubs, and NCS 
runtime is shown in the following figure. 
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Relationships Among Client, Server, Stubs and NCS Runtime 



Server Support Tools 

Augmenting the heterogenous interconnect layer are the server sup- 
port tools. These tools simplify the writing of complex applications 
in a distributed environment. Currently these consist of the Data 
Replication Manager (DRM) and Concurrent Programming Support 
(CPS) . DRM provides a weakly consistent, replicated database fa- 
cility. It is useful for providing replicated objects when high avail- 
ability is important and weak consistency can be tolerated. CPS 
provides integrated lightweight tasking facilities. CPS allows multi- 
threaded servers to be written easily. 



Brokers, Clients, Servers and User Interfaces 

Built on top of the server-support tools are a set of brokers. A 
broker is a third party agent that facilitates transactions between 
principals. In a network computing environment brokers are pri- 
marily useful in determining object locations, but can also be used 
for establishing secure communications (i.e., authentication), asso- 
ciatively selecting objects, issuing software licenses, and a variety of 
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other administrative chores not directly related to the operation of 
the principals. The role of brokers is shown in the next figure. 
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The Role of Brokers in NCA 

Client programs and application servers make use of the three base 
layers. Application servers are the "producers" of services and cli- 
ents the "consumers." Servers invoke brokers to make their exis- 
tence known. Clients can invoke brokers to locate application serv- 
ers and then use the underlying RPC mechanism to make use of the 
services provided. The application server may be in turn a client of 
other distributed services. 

From user's perspective, user interfaces tie all the pieces together. 
However, user interfaces are not part of NCA and will not be dis- 
cussed in this paper. 
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Unique Identifiers 



An important aspect of NCA is its use of universal unique identi- 
fiers (UUIDs) as the most primitive means of identifying NCA enti- 
ties (e.g., objects, interfaces, operations). UUIDs are an extension 
of the unique identifiers (UIDs) already used throughout Apollo's 
system [6]. Both UIDs and UUIDs are fixed-length identifiers that 
are guaranteed to refer to just one thing for all time. The principal 
advantages of using any kind of unique identifiers over using string 
names at the lowest level of the system include: small size, ease of 
embedding in data structures, location transparency, and the ability 
to layer various naming strategies on top of the primitive naming 
mechanism. Also, identifiers can be generated anywhere, without 
first having to contact some other agent (e.g., a special server on 
the network, or a human representative of a company that hands 
out identifiers) . 

UIDs are 64 bits long and are guaranteed to be unique across all 
Apollo systems by embedding in them the node number of the sys- 
tem that generated the UID and the time on that system that the 
UID was generated. To make it possible to generate unique identifi- 
ers on non- Apollo system we defined UUIDs to be 128 bits and 
made the encoding of the identity of the system that generates the 
UUID more flexible. 

The remainder of this paper discusses several aspects of NCA and 
NCS: NCA's object-oriented approach; NIDL; NDR; the NIDL 
compiler; the Location Broker used in connecting clients with serv- 
ers; and the networking model and protocol used by NCS. We con- 
clude with a description of future directions we expect NCA and 
NCS to follow. 



The Object-Oriented Approach 

NCA is object-oriented. By this we mean that it follows a paradigm 
established by systems such as Smalltalk [4], Eden [1, 5], and 
Hydra [12, 3]. The basic entity in an object-oriented system is the 
object. An object is a container of state (i.e. data) that can be 
accessed and modified only through a well-defined set of opera- 
tions (what Smalltalk calls messages). 
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The implementation of the operations is completely hidden from 
the client (i.e. caller) of the operations. Every object has some 
(what Smalltalk calls a class) . The implementation of a set of op- 
erations is called a manager (what Smalltalk calls a set of meth- 
ods). Only the manager of a type knows the internal structure of 
objects of the type it manages. Sets of related operations are 
grouped into interfaces. Several types may support the same inter- 
face; a single type may support multiple interfaces. 

For example, consider an interface called directory containing the 
operations add_entry, drop_entry, and list entries. This interface 
might be supported by two types: directory_of__files and 
print_queue. There are potentially many objects of these two 
types. That there are many objects of the type directory__of__files 
should be obvious. By saying that there are many print__queue ob- 
jects we mean that a system (or a network of connected systems) 
might have many print queues — say, one for each department in a 
large organization. 



Motivation 



The reason for using the object-oriented approach in the context of 
a network architecture is that this approach lets you concentrate on 
what you want done, instead of where it's going to be done and how 
it's going to be done: objects are the units of distribution, abstrac- 
tion, extension, reconfiguration, and reliability. 

Distribution. Distribution addresses the question of where an op- 
eration is performed. The answer to this question is that the opera- 
tion is performed where the object resides. For example, if the 
print queue lives on system A, then an attempt to add an entry to 
the queue from system B must be implemented by making a remote 
procedure call from system B to system A. (This implementation 
fpxt is hidden from the program attempting to add the entry.) 
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Abstraction. Abstraction addresses the question of how an opera- 
tion is performed. In NCA, the object's type manager knows how 
the operation is performed. For example, a single program Iist_di- 
rectory could be used to list both the contents of a file system di- 
rectory and the contents of a print queue. The program simply calls 
the Iist__entries operation. The type managers for the two types of 
objects might represent their information in completely different 
ways (because, say, of the different performance characteristics re- 
quired). However, the list__directory program uses only the ab- 
stract operation and is insulated from the details of a particular 
type's implementation. 

Extension. The object-oriented approach allows extension; i.e. it 
specifies how the system is enhanced. In NCA, there are two kinds 
of extensions allowed. The first is extension by creation of new 
types. For example, users can create new types of objects that sup- 
port the directory interface; programs like list_directory that are 
clients of this interface simply work on objects of the new type, 
without modification. The second kind of extension is extension by 
creation of new interfaces. A new interface is the expression of new 
functionality. 

Reconfiguration. Because of partial failures, or for load balancing, 
networked systems sometimes need to be reconfigured. In object- 
oriented terms, this reconfiguration takes place by moving objects 
to new locations. For example, if the system that was the home for 
some print queue failed because of a hardware problem, the system 
would be reconfigured by moving the print queue object to a new 
system (and informing the network of the object's new location). 

Reliability. The availability of many systems in a network should 
result in increased reliability. NCA's approach is to foster increased 
reliability by allowing objects to be replicated. Replication increases 
the probability that least one copy of the object will be available to 
users of the object. To make replication feasible, NCS provides 
tools to keep multiple replicas of an object in sync. 

While NCA is object-oriented and we believe that applications that 
use the object-oriented capabilities of NCA will be more robust and 
general than those that don't, it is easy to use NCS as a conven- 
tional RPC system, ignoring its object-oriented features. 
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Network Interface Definition Language 

The Network Interface Definition Language (NIDL) is the language 
used in the Network Computing Architecture to describe the re- 
mote interfaces called by clients and provided by servers. Interfaces 
described in NIDL are checked and translated by the NIDL com- 
piler. 

NIDL is strictly a declarative language — it has no executable con- 
structs. NIDL contains only constructs for defining the constants, 
types, and operations of an interface. NIDL is more than an inter- 
face definition language however. It is also a network interface defi- 
nition language and, therefore, it enforces the restrictions inherent 
in a distributed computing model (e.g. lack of shared memory). 



NIDL Language Constructs 

A NIDL interface contains an header, constant and type defini- 
tions, and operation descriptions. The header provides the inter- 
face identification: its UUID, name, and version number. The 
UUID is the "name" by which an interface is known within NCA. It 
is similiar to the program number in other RPC systems, except that 
it is not centrally assigned. The interface name is a string name for 
the interface which is used by the NIDL compiler in naming certain 
publicly known variables. The version number is used to support 
compatible enhancements of interfaces. 

A standard set of programming language types is provided. Integers 
(signed and unsigned) come in one, two, four, and eight byte sizes. 
Single (four-byte) and double (eight-byte) precision floating-point 
numbers are available. Other scalars include signed and unsigned 
characters, as well as booleans and enumerations. 

In addition to scalar types, NIDL provides the usual type construc- 
tors: structures, unions, pointers, and arrays. Unions must be dis- 
criminated. (Non-discriminated unions are not permitted. The ac- 
tual data values must be known at runtime so that it can be cor- 
rectly transmitted to the remote server.) Pointers, in general, are 
restricted to being "top-level." That is, pointers to other pointers, 
or records containing pointers are not permitted. Later, we'll see 
how this restriction can be relaxed. Arrays can be fixed in size or 
have their size determined at runtime. 
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Operation declarations are the heart of a remote interface defini- 
tion. These define the procedures and functions that servers imple- 
ment and to which clients make calls. All operations are strongly 
typed. This enables the NIDL compiler to generate the code to 
correctly copy parameters to and from the packet and to do any 
needed data conversions. Operation declarations can be optionally 
marked to have certain semantic properties, for example whether 
they are idempotent. (An idempotent procedure is one that can be 
executed many times with no ill-effect.) 

All operations are required to have a handle as their first parame- 
ter. This parameter is similar to the implicit "self" argument of 
Smalltalk-80 or the "this" argument of C++ [9]. The handle argu- 
ment is used to determine what object and server is to receive the 
remote call. NIDL defines a primitive handle type named han- 
diest. An argument of this type can be used as an operation's han- 
dle parameter. Clients can obtain a handle__t by calling the NCS 
runtime, providing an object UUID and network location as input 
arguments. Use of more abstract kinds of handles is described be- 
low. 

Handle arguments can be implicit. An interface definition can de- 
clare that a single global variable should be treated as the handle 
argument for all operations in the interface. While this style con- 
flicts with some of the goals of the object-oriented approach (e.g., it 
makes it harder to make calls on different objects using the same 
interface) , it can be useful in cases where an existing local interface 
is being converted to work remotely. 



NIDL Example 



The following figure is a short example of an interface described in 
NIDL. The example is of an interface to a bank object that sup- 
ports a single operation: deposit money into an account. 

(1) Defines the UUID by which this interface is known. This the 
first version of this interface. If in the future, new operations are 
added, the version number should be incremented. (2) Declares 
the interfaces upon which this interface is dependent. The import 
statement is similiar to #include, except that the named interface is 
not textually included. The contents are made available for the im- 
porter to refer to types and constants defined in that interface. This 
allows factoring out a common set of types into a base interface. (3) 
Defines a set of types (account and account name types) that are 
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used by the bank operations. Finally (4) defines the operation it- 
self. 

A variant of NIDL that looks Pascal-like (as opposed to the C-like 
version of which the figure is an example) is also available. Regard- 
less of the variant used as input to the NIDL compiler, the output is 
the same. 



[uuid(334033030000.0d.000.00.87.84.00.00.00) , version(l) ] 
(1) 

interface bank { 
import 

"nbase. imp. idl" ; (2) 

typedef (3) 

long int bank$acct_t ; 
typedef 

char bank$acct_name_t [32] ; 
void bank$deposit ( (4) 

[in] handle_t h, 

[in] bank$acct_t acct, 
[in] long int amount, 

[out] status_$t *status 

); 
}; 

Example Interface 



Object-Oriented Binding 

One drawback of the language as described so far is that all opera- 
tions are required to have a primitive handiest as their first argu- 
ment. This means clients need to embed these handles in their pro- 
grams, and to manage the binding to servers themselves. We would 
like to achieve as much local-remote transparency as possible (i.e., 
to make programs insensitive to the location of the objects upon 
which they operate). Embedding primitive handles in client pro- 
grams destroys much of this transparency. To relieve clients of the 
need to manage these handles, we introduced the notion of object- 
oriented binding. 

Object-oriented binding comes into play when the first parameter to 
an operation is not a handle__t. In this case, the type is taken to 
represent some more abstract, client-oriented handle. Since to ac- 
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tually make remote calls, a handle__t is required, some way is 
needed to translate the abstract handle into a handiest. The person 
who creates the abstract type is thus obliged to write a procedure to 
do the conversion. This procedure is assumed to have the name 
typejbind (where type is the type name of the abstract handle) and 
is . automatically called from stubs when the remote call is made. 
You can view the abstract handle as an object (in the Smalltalk 
sense) which supports the bind operation. 

To make this more concrete, we could reformulate the above bank 
example in terms of object-oriented binding. Instead of taking a 
handiest as its first parameter, bankSdeposit could take a bank 
name, of type bank$name. The NIDL compiler would generate a 
call to bank$namejbind to translate from a bank name to the 
primitive handle_t. This routine would probably call upon some 
sort of naming server to look up the bank location. The bind rou- 
tine might also choose to cache location information to make later 
translations faster. 

Object-oriented binding hides the details of handle binding from 
the client and allows interfaces to be designed in a more abstract, 
client-oriented fashion. This provides a higher level of local-remote 
transparency than other systems which always require the client to 
manage handles or explicitly name the remote host on each call. 



Marshalling Complex Types 

In the section on NIDL language constructs, we stated that pointers 
could not be nested. The reason is that such nesting would require 
the NIDL compiler to generate code to transmit general graph 
structures. However, permitting only top-level, non-nested pointers 
can be a severe limitation in the design of an interface. For exam- 
ple, it excludes passing tree data structures to remote procedures. 

To provide an escape from this restriction, NIDL allows a type to 
have an associated "transmissible" type. The transmissible type is a 
type that the NIDL compiler does know how to marshall. Any type 
that has an associated transmissible type must have a set of proce- 
dures to convert that type to and from its transmissible type. In the 
example of the binary tree, the transmissible type could be an ar- 
ray. The tree$to_xmit_rep procedure would walk the tree to build 
a representation of it in the array, and the tree$from_xmit_jrep 
procedure would reconstruct the binary tree from the array. 



8-12 Network Computing Architecture 



Transmissible types may be associated with any type, not just types 
using nested pointers. Bitmaps are an example. It may be repre- 
sented internally as a fixed size array of integers. Even though the 
NIDL compiler is capable of marshalling this, it may be more effi- 
cient to have it transmitted in a run-length encoded (RLE) form. 
So the bitmap type could have an associated RLEBitmap type, and 
a set of procedures for converting to and from the RLE form. 



Network Data Representation 

Communicating typed values in a heterogenous environment re- 
quires a data representation protocol. A data representation proto- 
col defines a mapping between typed values and byte streams. A 
byte stream is a sequence of bytes indexed by nonnegative integers. 
Examples of data representation protocols are Courier [13] and 
XDR [10]. A data representation protocol is needed because differ- 
ent machines represent data differently. For example, VAXes rep- 
resent integers with the least significant byte at the low address and 
68000s represent integers with the most significant byte at the low 
address. A data representation protocol defines the way data is rep- 
resented so that machines with different local data representation 
can communicate typed values to each other. 

NCA includes a data representation protocol called Network Data 
Representation (NDR). NDR defines a set of data types and type 
constructors which can be used to specify ordered sets of typed 
values. NDR also defines a mapping between ordered sets of values 
and their representations in messages. 

Under NDR, the representation of a set of values consists of two 
items: a format label and a byte stream. The format label defines 
how scalar values are represented (e.g. VAX or IEEE floating 
point) in the byte stream; its representation is fixed by NDR as a 
data structure representable in four bytes. 

NDR supports the scalar types boolean, character, signed integer, 
unsigned integer, and floating point. Booleans are represented in 
the byte stream with one byte; false is represented by a zero byte 
and true by a non-zero byte. Characters are represented in the byte 
stream with one byte; either ASCII or EBCDIC codes can be used. 
Four sizes of signed and unsigned integers are defined: small, short, 
long, and hyper. Small types are represented in the byte stream 
with one byte, short types with two bytes, long types with four bytes, 
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and hyper types with eight bytes. Either big- or little-endian repre- 
sentation can be used for integers; two's complement is assumed for 
signed integers. The two sizes of floating-point type are single and 
double. Single floating-point types are represented with four bytes 
and double floating-point types use eight bytes. The supported 
floating-point representations are IEEE, VAX, Cray, and IBM. 

In addition to scalar types, NDR has a set of type constructors for 
defining aggregate types. These include fixed size arrays, open ar- 
rays, zero terminated strings, records, and variant records. 

Fixed sized arrays have a known number of elements. Their values 
are represented in the byte stream simply as a sequence of repre- 
sentations of the values of the elements. Each element value is rep- 
resented according to the element type of the array. Open array 
types have a fixed first index value and element type but their final 
index value is not known from their type. Therefore, it is necessary 
to represent the value of the index of the last element in the array 
immediately before the representation of the values of the array 
elements. 

Zero terminated strings can be viewed as a special case of open 
arrays; they are open arrays of characters whose last index value is 
defined by a terminating zero byte. To support this common data 
type in an efficient manner, NDR represents such values with an 
explicit length value followed by the characters of the string includ- 
ing the terminating zero character. 

Record values are represented in the byte stream by representations 
of the values of their fields in the order defined by the record type. 
Variant records are assumed to have an initial set of fixed fields 
which includes a tag field used to discriminate among the possible 
variants. Representations of the values of the fields of the selected 
variant follow the representations of the values of the fixed fields of 
a variant record value. 

Some types may appear to be missing from NDR. NDR has no enu- 
merated types, bit set types, or a pointer type constructor. The defi- 
nition of a NIDL maps such types onto their representations in an 
NDR byte stream. For example, NIDL maps enumerated types and 
bit sets onto the NDR unsigned integer type of the appropriate size. 
Typed pointer values are mapped into the NDR type which repre- 
sents the type that the pointer references. 
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NDR is abstract in that it does not define how the format label and 
the byte stream are represented in packets. The NIDL compiler 
and the NCA/RPC protocol are users of NDR: They work together 
to generate the format label and byte stream, encode the format 
label in packet headers, fragment the byte stream into packet-sized 
pieces, and put the fragments in packet bodies. 

The important features of NDR are its flexible representation of 
scalar values, its use of natural alignment, and its extensibility. 

By using a format label to specify an interpretation of the scalars in 
a byte stream NDR supports a "recipient makes it right" approach 
to data conversion in a heterogenous environment. A sending proc- 
ess can use its preferred encoding of scalars when constructing a 
byte stream providing that it is one of the defined options. A receiv- 
ing process needs to convert data representations only when the 
format specified in the incoming format label differs from its own 
preferred format. Thus, two compatible machines can communicate 
efficiently without needing to convert to a conventional network 
format and back again on each transmission. NDR defines a 
broadly useful but not universal set of scalar formats. We believe 
that our choices are reasonable for promoting heterogenous net- 
work computing combining workstations and special purpose server 
machines. On the other hand, it is important to keep the space of 
possible formats to a reasonable size because each recipient needs 
to convert any incoming scalar format to its own. 

NDR requires that values be natually aligned in the byte stream. 
Natural alignment means that all values of size 2*n are aligned at a 
byte stream index which is a multiple of 2*n, up to some limiting 
value of n; NDR choses this limit to be 3. (Scalars of size up to eight 
bytes are naturally aligned.) This permits, but does not require, 
implementations of NCA to align buffers for the byte stream so that 
stub code can use natural operators to manipulate values in the byte 
stream efficiently and without alignment faults. This also helps to 
promote communication ease between different kinds of machines 
in a heterogenous environment. 

By its use of a format label NDR is an extensible data representa- 
tion protocol. The format label could be extended to specify other 
aspects of the data representation such as packing disciplines, dy- 
namic typing schemes, new encodings of scalars, or new classes of 
scalars. 
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The NCS NIDL Compiler and Stub Functions 

NCS includes a compiler which mediates between NIDL on the one 
hand and NDR and the NCS runtime on the other. The functions 
of the compiler are: checking the syntax and "semantics" of inter- 
face definitions written in NIDL; translating NIDL definitions into 
declarations in implementation languages such as C; and generating 
client and server stubs for executing the remote operations of an 
interface. 

The NIDL compiler is organized as a front-end component and a 
back-end component. The front-end parses and checks an inter- 
face definition and produces an abstract syntax tree (AST) inter- 
mediate form. If the interface definition is sound, the front-end 
then passes this tree to the back-end which generates implementa- 
tion language include files and stub code files for the interface. 

NCS's NIDL compiler is implemented for portability in C using 
Yacc and Lex. It is available in source form to encourage its use 
and extension in heterogeneous networked environments. 



NIDL Compiler Functions 

Distributed object-oriented programming imposes certain restric- 
tions on the semantics of interfaces. It is part of the compiler's job 
(along with the design of NIDL) to enforce these restrictions. We 
illustrate the front-end's semantic checks with some examples. All 
types used in a definition must be well defined. All parameters and 
fields whose type is an open array require the use of a last_is attrib- 
ute to give their size at call time. Every remote interface requires a 
UUID. Every operation of an interface requires an implicit or ex- 
plicit handle parameter to support object-oriented programming. 

The second major function of the NIDL compiler is to derive files 
which declare the interface's constants, types, and operations in the 
languages in which client applications and servers are written. These 
files are included in client and server programs which use or imple- 
ment the remote operations of an interface. For the current imple- 
mentation the supported languages are C and Pascal. Generating 
these files is done by a fairly straightforward walk over the AST; 
adding the capability to generate include files in other ALGOL-like 
languages would be a simple exercise. 
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In addition to declaring the constants, types, and operations of an 
interface, the derived include files declare two important statically 
initialized variables defined for each interface. One is the interface 
specification (ifspec) which encapsulates the identity of the inter- 
face and its salient properties (number of operations, well-known 
ports used, etc.). The ifspec variable is used in the binding and 
registering operations of the NCS runtime. The second variable is 
the server Entry Point Vector (EPV) which holds pointers to the 
server side's stub routines. This EPV variable is used by a server 
process when registering as a server for an interface; it is used by 
the NCS runtime to dispatch incoming calls. 

The third major function of the NIDL compiler is to generate files 
of stub code for the operations defined in an interface. There are 
two such files — one contains client side stub routines and the other 
contains server side stub routines. This emitted code is in standard 
C, which we use as a universal assembler to promote portability. 
Each operation in an interface gives rise to a client stub routine and 
a server stub routine. The following section discusses the functions 
of these routines. 



Stub Functions 



Client stub routines are called by clients of an interface; they have 
the same interface as the operation for which they stand in. Server 
stub routines are called by the server side NCS runtime; their inter- 
face is defined by NCS. Client stub routines call the client side NCS 
runtime to perform remote calls. Server stubs call the manager's 
implementation of an operation to provide the actual service. Thus, 
the first function of stubs is to hide the NCS runtime from users and 
implementors of remote interfaces and to create the illusion of ac- 
cessing a remote procedure as though it were local. 

To communicate input and output arguments and function results 
between callers and called routines the stub must marshall and 
unmarshall argument values into call and reply packets. This is 
done in accordance with NDR and the conventions of NCS. Un- 
marshalling code is also responsible for detecting and performing 
necessary data conversions by comparing the incoming format label 
with the local formats. Data conversion is done by a combination of 
inline code and support operations in the NCS runtime. 

The stubs also need to calculate the size requirements for call and 
reply packets based on the dynamic size of input and output argu- 
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merits. The size information is used to determine whether or not a 
pre-declared packet on the stack is large enough. If not, the stubs 
need to allocate and free storage for packets. It is not the job of the 
stub to break up a large packet into pieces that can be sent over the 
network — the NCS runtime provides the capability of handling 
arbitrarily sized packets. 

Client side stubs map the operations of an interface to the operation 
number used by the NCS runtime to identify operations; they also 
pass options designating the desired calling semantics and the ifspec 
derived from the NIDL declaration of an operation to the NCS 
runtime's remote call primitive. 

On the server side, the stub routines are responsible for managing 
storage to be used as the server side surrogates for dynamically 
sized arguments. This is necessary to support the server's illusion of 
large data structures passed to it by reference. 

The stubs also manage the more elaborate features of NIDL de- 
scribed in section 3 above. Client stubs support automatic binding 
by calling users' binding and unbinding routines when necessary. 
Implicit handles are made explicit to the NCS runtime by client stub 
routines. Users' marshalling routines are invoked as necessary by 
both client and server stubs as part of marshalling input and output 
arguments of the appropriate types. 

In summary, the stub generation function of the NIDL compiler 
automates the production of a large amount of protocol code based 
on a routine's interface defintion. This is important because the 
code is complex enough to make its hand coding very error prone 
and tedious. Hand producing this kind of code has been a major 
impediment to building distributed systems in the past. 
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Location Broker 



A highly available location service is a fundamental component of a 
distributed system architecture. Objects representing people, re- 
sources, or services are transient and mobile in a network environ- 
ment. Consumers of these entities cannot rely on a priori knowl- 
edge of their existence or location, but must consult a dynamic reg- 
istry. When consumers rely solely on a location service for accessing 
objects, it becomes essential that the location server remain avail- 
able in the face of partial network failures. 

The NCA Location Broker (NCA/LB) protocol is designed to pro- 
vide a reliable network-wide location broker. This protocol is de- 
fined by a NIDL interface and is thereby easily used by any NCA/ 
RPC based application. 

The NCA/LB, unlike location services like Xerox SDD's Clearing- 
house [8] or Berkeley's Internet Name Domain service (BIND) 
[11], yields location information based on UUIDs rather than on 
human readable string names. The advantages of using UUIDs were 
described earlier. 



Locating 



An object's type manager must first advertise its location with the 
Location Broker in order for that objected to locatable. A manager 
advertises itself by registering its location and its willingness to sup- 
port some combination of specific objects, types of objects, or inter- 
faces. A manager can choose to advertise itself as a global service 
available to the entire network, or limit its registration to the local 
system. Managers that choose the latter form of registration do not 
make themselves unavailable, but rather limit their visibility to cli- 
ents that specifically probe their system for location information. 

Clients find objects by querying the Location Broker for appropri- 
ate registrations. A client can choose to query for a specific object, 
type, interface, or any combination of these characteristics. When 
operations are externally constrained to occur at a specific location, 
a client can choose to query the location broker at the required 
system for managers supporting the appropriate object. 
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Location Broker Organization 

The Location Broker is divided into two components. The Global 
Location Database is a replicated object containing the registration 
information of all globally registered managers; the processes that 
manage this database are called the Global Location Broker. The 
NCS runtime implementation of the Global Location Broker uses 
the Data Replication Manager (DRM) to maintain the database. 
DRM provides a weakly consistent replicated KSAM package. 
Weak consistency implies that replicas of the Global Location Data- 
base object may be inconsistent at any time, but, in the absence of 
updates, that all replicas will converge to a consistent state within a 
finite amount of time. This form of consistency provides a high de- 
gree of both read and update availability to the Global Location 
Database. It is not necessary to be able to communicate with all 
replicas of the object to affect a change in the registration database. 
The DRM assumes the responsibility of propagating updates to the 
replicas in a timely fashion. 

A Local Location Broker supports managers that wish to limit 
their registration to the local system. Access to these registrations if 
provided in two ways. A client can directly query the Location Bro- 
ker at specific node to determine the objects and managers that are 
registered there. Alternately, a client can simply execute a remote 
operation while supplying an incompletely bound handle (i.e., one 
which specified only an object and system, not a particular server 
process). Remote calls made using such a handle are delivered to 
the Local Location Broker, which serves as a forwarding agent if an 
appropriate manager has registered itself locally. This mechanism 
obviates the need for users of the NCA to use well-known ports. 

The division of the Location Broker into two distinct entities is, to a 
large degree, an NCS runtime implementation decision. Logically 
the Local Location Database object and the Global Location Data- 
base object are a single partitioned object, and, in fact, access to 
these databases is provided through a common set of operations 
which select the target based on lookup keys. 
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The NCA/RPC Protocol and NCS Implementation 

The NCA/RPC protocol is designed to be low cost for the common 
cases and independent of the underlying network protocols on top 
of which it is layered. The NCS runtime implementation of the 
NCA/RPC protocol is designed to be portable. 



Protocol 



The NCA/RPC protocol is designed so that a simple RPC call will 
result in as few network messages and have as little overhead as 
possible. It is well known that existing networking facilities designed 
to move long byte streams reliably (e.g., TCP/IP) are generally not 
well suited to being the underlying mechanism by which RPC run- 
times exchanges messages. The primary reason for this is that the 
cost of setting up a connection using such facilities and the associ- 
ated maintenance of that connection is quite high. Such a cost 
might be acceptable if, say, a client were to make 100 calls to one 
server. However, we don't want to preclude the possibility of one 
client making a call to 100 servers in turn. In general, we expect the 
number of calls made from a particular client to a particular server 
to be relatively small. The reliable connection solution is also unac- 
ceptable from the server's perspective: A popular server may need 
to handle calls from hundreds of clients over a relatively short pe- 
riod of time (say 1-2 minutes). The server does not want to bear 
the cost of maintaining network connections to all those clients. 

The well-known way of getting around the well-known problem of 
using reliable network connections is to make the RPC protocol 
implement exactly the reliability it needs on top of an unreliable 
network service (e.g., UDP/IP). This approach has the additional 
advantage that some systems (e.g. embedded microprocessors) can 
not or do not support any reliable network service; however, if 
they're connected to a network at all, you can be sure that they'll at 
least supply an unreliable service. Further, unreliable services tend 
to be more similar across protocol suites than do reliable services. 
(For example, some reliable protocols might return errors immedi- 
ately if the network partitions even though a virtual circuit is cur- 
rently idle, while others might defer until the next time I/O is at- 
temped.) This similarity means that the RPC protocol can be accu- 
rately implemented in more protocol suites than if it would be possi- 
ble if it assumed a reliable service. 
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All that the NCA/RPC protocol assumes is an underlying unreliable 
network service. The protocol is robust in the face of lost, dupli- 
cated, and long-delayed messages, messages arriving out of order, 
and server crashes. When necessary, the protocol ensures that no 
call is ever executed more than once. (Calls may execute zero or 
one times and, in the face of network partitions or server crashes, 
the client may not know which.) 

The NCA/RPC protocol operates roughly as follows. The client side 
sends a packet describing the call (a request packet) and waits for 
a response. The server side receives and dispatches the request for 
execution, and sends a packet in response that describes the results 
of executing the call (the response packet). If the client doesn't 
receive a response to a request within a particular amount of time, 
it can inquire about the status of the request by sending a ping 
packet/The server either sends back a working packet, indicating 
that execution of the request is in progress, or a nocall packet, 
which means that the request has been lost (or that the server has 
crashed and rebooted) and the client needs to resend it. The proto- 
col gets slightly more complicated if the input or output arguments 
do not fit into one packet. 

If a called procedure is non-idempotent, the protocol ensures that 
the server executes the call at most once. To detect old (duplicate) 
requests, the server keeps track of the sequence number of the 
previous request for each client with which it has communicated. 
However, the server considers this information to be discardable 
and it may discard it if it hasn't heard from the client in a while, 
i.e., there is no permanent "connection" between the client and 
server.) Thus, it is possible for a long-delayed duplicate request to 
arrive after the server has discarded the information about the re- 
questing client. To handle this case, the server calls back to the 
client (using an idempotent remote procedure call) to ask the client 
for the client's current sequence number. The server then uses the 
returned sequence number to validate the request. Note then that 
for calls to non-idempotent procedures (with input and output argu- 
ments that fit in a single packet) , a total of two message pairs will be 
exchanged between client and server for the simple case. Subse- 
quent calls between the same client and server will require just one 
message pair. Note that the extra message pair in the first case 
could conceivably be eliminated if the server were willing to hold 
onto client sequence number information for long enough to ensure 
that all duplicate requests had been flushed from the network. We 
chose not to take this approach since any time interval we consid- 
ered long enough (e.g., one minute or more) seemed too long to 
oblige the server to hold the information. 
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Also, for non-idempotent procedures, the server side saves and pe- 
riodically retransmits the response packet until the client side has 
acknowledged receipt of the response. If the server side receives a 
retransmission of the request, it resends the saved response instead 
of re-executing the call. The client side acknowledges the response 
either implicitly, by sending a new request, or explicitly, by sending 
an acknowledgment packet. The protocol also handles the case in 
which the server has executed the non-idempotent call but, because 
of network partitions or a server crash, fails to send the response 
packet. 

If a called procedure is idempotent, the protocol makes no guaran- 
tees about how many times the procedure is executed. On idem- 
potent requests, the server side does not save the results of the 
operation once it has sent back the response packet. In addition, 
the client side is not required to acknowledge the receipt of re- 
sponses to idempotent requests. 



Runtime 



The NCS RPC runtime is written in portable C and uses the BSD 
UNIX socket abstraction. (In terms of the socket abstraction, it 
uses SOCK__DGRAM-style sockets.) This abstraction is intended to 
mask the details of various protocol families so that one can write 
protocol-independent networking code. (A protocol family is a suite 
of related protocols; e.g. TCP and UDP are part of the DoD IP 
protocol family; PEP and SPP are part of the Xerox NS protocol 
family.) In practice, however, the socket abstraction has to be ex- 
tended in several ways to make it possible to write truly protocol-in- 
dependent code. We extended the socket abstraction via a set of 
operations implemented in a user-mode subroutine library; the NCS 
runtime uses these extensions so that it can be truly protocol-inde- 
pendent. Bringing up the NCS runtime on a new protocol family 
should not require any changes to the NCS runtime proper. All that 
should be required is to add some relatively trivial routines to the 
socket abstraction extension library. 

NCS is careful about creating sockets. Sockets are a fairly scarce 
resource and tying lots of them up for a long period is not a good 
idea. NCS keeps of small private pool of sockets. One is pulled 
from the pool when a process makes a remote call. When the call 
completes, the socket is returned to the pool. The pool need con- 
tain only one socket for the entire process if the system supports 
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only one thread of control per process (as is the case in standard 
UNIX). 

The use of the socket abstraction at all could be considered to be 
too much of a BSD-ism, thus reducing the portability of the run- 
time. Fortunately, two factors argue against this point of view: First, 
it appears that AT&T System V, Release .3 will support at least a 
sufficient subset of the socket calls (layered on top of their own 
networking model). Second, even if the target of a port doesn't 
have anything resembling the socket interface, NCS use of the in- 
terface is fairly simple and it wouldn't be too hard to implement the 
BSD calls in terms of whatever the target system supplies. 



Future Directions 

NCA and NCS represent the first step in a complete network com- 
puting environment. One of the guiding goals in the development of 
NCA has been transparency. This has a number of aspects: replica- 
tion, failure, concurrency, location, and name transparency. 

With replication transparency all copies of an object can be consid- 
ered equivalent. The user of an object cannot tell whether it con- 
sists of a single copy or many. The DRM provides replication trans- 
parency in the case where some short-lived inconsistencies can be 
tolerated. Future versions of NCA will include support for strongly 
consistent replication. 

Location transparency allows users to access objects without speci- 
fying where the objects are. Objects are free to be moved around 
the network to adapt to changing load conditions and the availabil- 
ity of new hardware. The Location Broker provides the ability to 
find the location of objects prior to their first use. We would like to 
be able to have objects move at any time during program execution. 

Concurrency transparency supports the illusion that a given client is 
the sole user of an object. NCS addresses this partially through con- 
current programming support which provides a simple locking facil- 
ity. In the future, we would like to address this, and to some de- 
gree, failure transparency, through the use of an object-oriented 
atomic transaction facility. 
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Failure transparency, i.e., the ability of components of a distributed 
system to fail and recover transparently to their users, is largely a 
function of location and replication transparency. By replicating 
objects, when a given replica fails another is available to takes its 
place. Location transparency hides the switch from one replica to 
another from the user. 

Neither NCA nor NCS address the issue of name transparency at 
this point. We anticipate building a general purpose name server in 
a future version of NCS. In addition, we intend to address a higher- 
level form of naming: In many instances, it is more convenient to 
find an object by attributes rather than by a text name. An attribute 
broker will provide this ability. Thus, a client will be able to query 
the attribute broker for a list of "26 page/sec laser printers" rather 
than managing the mapping between machine names and attributes 
itself. 

Most of the focus in the NCA development so far has been on 
getting the basic model right. Once the object-oriented model is in 
place, we feel that these higher level services will evolve naturally. 
Had we started with a more traditional process-oriented model, the 
level of integration and transparency we desire would be much 
more difficult to achieve. 
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An Extensible I/O System 

by 
Jim Rees, Paul H. Levine, Nathaniel Mishkin, Paul J. Leach 



Introduction 



For years, programming environments have provided device inde- 
pendent program I/O. The programmer normally codes file I/O re- 
quests using a standard set of procedure calls, such as the UNIX 
open, close, read, and write system calls, or language specific I/O 
calls. This model enables a program written primarily to perform 
I/O to simple files to also read from keyboards or IPC channels, 
and to write to display windows or IPC channels without any modifi- 
cation. The intent is to unburden the programmer from the neces- 
sity of either binding the program to a specific target for its I/O or 
enabling the program to adjust to the vagaries of different I/O tar- 
gets at program run-time; that is, to make the applications program 
I/O independent of target type. 

While this concept has been around for a long time, the systems 
that implemented the concept have generally had one major short- 
coming. The only way to add a new type of I/O target to the system 
was to modify the system source. In the case of UNIX operating 
systems, for example, it is necessary to modify and rebuild the op- 
erating system kernel and to have all of the software that imple- 
ments the management of the new I/O target permanently wired 
into physical memory. Most schemes for adding new file types to 
the UNIX kernel operate at the file system level, so that within a 
given file system, all files have the same type. Further, whenever a 
new type is added, various pieces of the system have to be modified 
to behave correctly with respect to the new type. Because of this 
sizable burden, programmers are discouraged from defining numer- 
ous I/O target types. 
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Our goal was to create a framework in which file I/O could be truly 
extensible — to allow users to define new types without modification 
to the basic system. Our work consisted of building a general frame- 
work for extensibility and then applying those techniques to stream 
I/O. We call the framework a typed object management system; 
and the associated file I/O facility Extensible Streams (ES). The 
combination of these two is called the Domain Open System 
Toolkit. 

The system resulting from our work is novel because it: 

• Supports (relatively large) typed, permanent, sharable ob- 
jects in a distributed file system. 

• Allows users to define new types of objects. 

• Allows users to associate generic procedures (operations) 
with types; the procedures are dynamically loaded into the 
address space of processes when the procedure is invoked. 

The Open System Toolkit allows users to extend the Domain file 
system by inventing new file types and writing managers for these 
types. The current implementation allows dynamic creation of new 
types, and dynamic binding of typed objects to the managers which 
implement their behavior. Type managers are written and debugged 
as user programs and require no kernel modifications for installa- 
tion. This system has been used successfully to write and debug new 
device drivers, to add new types of files, and to provide remote file 
system interconnects to foreign file systems. 



Domain Architecture 

The Domain system [3] is an architecture for networks of personal 
workstations and server computers that creates an integrated dis- 
tributed computing environment. A major component of this dis- 
tributed system is a distributed file system [4] which consists of four 
major components: the object storage system, mapped file manage- 
ment, concurrency control and naming service. 

The Domain distributed object storage system (OSS) provides loca- 
tion transparent typed object management across a network of 
loosely coupled machines. We say "object" rather than file to spe- 
cifically include all of the named non-disk objects in a computing 
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environment, such as devices (serial I/O lines, magtapes, null, 
etc.), IPC facilites (sockets, etc.) and processes. While a naming 
service manages a network-wide hierarchical name space, at the 
OSS level objects are named by a 64-bit unique identifier (UID). 
The UID consists of a timestamp and a unique node ID. This guar- 
antees that the UID is unique across all Domain nodes for all time. 

A 64-bit object type UID is associated with every object. This type 
is used to divide the set of all objects into classes of like objects; all 
of the objects in a class have common properties and must be oper- 
ated upon by a single set of procedures. We use a UID (rather than 
any other kind of type identifier) because a system facility supports 
the unique creation of these 64-bit numbers across all Apollo prod- 
ucts. In the basic Domain system there are several types, including 
ASCII text, binary, directory, and record. This strong typing allows 
the creator of an object to explicitly specify its intended use and 
interpretation, rather than depending on the conventions and coop- 
eration of other users and programs. 

The Domain OSS supports a consistent set of facilites for naming, 
locating, creating, deleting, and providing access control and ad- 
ministration over all objects. Each object has an inode, which we 
have extended to contain (among other things) the type UID of the 
described object. 

For disk-based objects OSS also provides storage containers (arrays 
of pages) for uninterpreted data. A process accesses this data by 
handing the kernel the object's UID and asking for it to be mapped 
into its address space. The process then uses ordinary machine in- 
structions to directly manipulate the contents of the object — the 
single-level store (SLS) concept of Multics [6], Pilot [7], and Sys- 
tem/38 [1]. 

Layered on top of the file system is the Streams library, a user state 
library mapped into every process's address space, which provides a 
traditional I/O environment for programs. The Streams library im- 
plements the standard I/O interfaces and so provides equal access 
to both disk and non-disk resident objects. The Domain Stream 
operations form a superset of the UNIX file I/O operations, as they 
include record-oriented operations and more inquiry operations 
but are all based on a file descriptor returned to callers of open. 
Streams is an object-oriented facility in that its behavior is deter- 
mined by the type of object to which its operations are applied. 
When a stream operation is invoked, Streams calls the manager 
that handles operations for the type of the object being operated 
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on. The following figure diagrams the relationships among the vari- 
ous pieces of the Domain object management system and Streams. 
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The relationships among the various pieces of the Domain object 
management system and Streams. 
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Typed Object Management 

The fundamental concept underlying the object-oriented part of 
our system is the notion that every object is strongly typed and that 
for each object type there is a set of executable routines that imple- 
ment a well-specified group of operations on that type. This sec- 
tion describes the object typing strategy, defines operations and de- 
scribes their partitioning into traits. It also explains the management 
facilities necessary to associate typed objects with the code that im- 
plements the operations defined for them. 

UNIX file system objects do not have an explicit type tag, but do 
keep a form of type information in several different places. The 
mode field in a file's inode contains some bits that distinguish 
among ordinary files, directories, character and block special files 
(devices), and depending on the version of UNIX system, FIFOs, 
sockets, textual links, and other types of file system objects. There 
may also be type information coded into the major and minor de- 
vice numbers to, for example, distinguish between tape drives and 
disk drives. In some cases, type information is encoded in the first 
few bits of the file data itself. For instance, there may be "magic 
numbers" for tagging various flavors of executable (a. out) files. 

In the Domain system, the type tag is a UID which is explicitly 
attached to the object at the time that the object is created. This 
provides the advantage of a single, common mechanism to distin- 
guish among all types. The use of a UID (rather than a small inte- 
ger) allows the arbitrary creation of new types without appealing to 
a central authority. 

The fundamental concept underlying the object-oriented part of 
our system is the notion of an object type as a set of legal states 
together with a collection of operations that implement the state 
transitions. Operations can be viewed in two ways: as a specification 
of how to invoke a transformation on the state of an object, or as 
the executable code that performs the transformation. The collec- 
tion of code that implements the set of operations for an object type 
is known as that object type's type manager. 

A trait is an ordered set of operations. It represents a kind of be- 
havior that a client desires from an object. For example, the opera- 
tions open, close, read and write could be a "stream-like" trait, 
and the operations set speed and echo input could be part of a 
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"tty-like" trait. An object supports a trait if its type manager imple- 
ments the operations of the trait. For every trait that a type man- 
ager supports, the manager provides an entry point vector (EPV), 
that is an ordered list of pointers to the procedures that implement 
the operations in the trait. 



stream 
trait ^ 



tty 
trait 



set_speed 
set_erase 
set Rill 



set_parity 




Entry Point Vectors 

* s ' Implementations of the Operations 

Object Type Manager 

The type manager consists of the routines that implement one or 
more sets of operations (traits) and the entry point vectors (EPVs) 
that map the supported operations to the routines that implement 
them. 

The implementation of the typed object management system has 
two main components: the type system and the trait system. We 
use the name Trait/Types/Managers (TTM) to refer to these two 
components plus the set of all type managers. 

The type system is responsible for maintaining a data base contain- 
ing mappings between type UID, type manager, and type name. 
New types can be created at will. For convenience, there is a name 
for every type, but a type UID rather than a type name is actually 
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attached to the file system objects. This guarantees that all types are 
unique, even if two different implementors independently choose 
the same name. The type system provides procedures that can be 
used to create new types, associate a name with a type, and look up 
type UID of a given type name. It can also find the manager for a 
given type. 

The role of the trait system is to bind <object, trait> pairs to type 
manager EPVs. It provides the trait_$bind call for this purpose. 
This call looks up the object's type UID and then asks the type 
system for the corresponding type manager. Object code libraries 
containing managers are not pre-linked with client object code. 
Rather, the trait system is responsible for dynamically loading them 
into the address space of clients as necessary. To perform this task, 
the trait manager uses the type system to locate the manager object 
code file. It then loads the manager into the address space of the 
client. The type manager is linked as an autonomous program 
whose main entry point is called when the manager is loaded. The 
code at this entry point registers all supported <trait, EPV> pairs 
with the trait system. Once the manager is loaded, the trait system 
returns the requested EPV to its client. 

The type definition for an EPV corresponding to a trait that de- 
scribes operations on stacks might look like: 



typedef struct { 

void (*push) (uid_$t , stack_$elem_t) ; 

stack_$elem_t (*pop) (uid_$t) ; 
} stack_$epv ; 

The actual EPV for a type manager that supported the stack trait 
would be declared as: 



stack_$epv my_stack_epv = { 
my_push , 
my_pop, 

}; 

where my_push and my_pop are the names of real procedures 
that implement the push and pop operations: 



void my_push(obj , elem) 
uid_$t obj ; 

stack_$elem__t elem; 

{ 
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stack_$elem_t my_pop(obj) 

uid_$t obj; 

{ 



The client uses trait_$bind to get a pointer to an EPV from the 
trait system: 



trait_$epv *trait_$bind(obj , trait, typuidp, statusp) 

uid_$t obj; /*.IN: object we want to operate on */ 

trait_$t trait; /* IN: trait we want to use */ 

uid_$t *typuidp; /* OUT: type of object */ 

status_$.t *statusp; /* OUT: status */ 

Once a client has called trait_$bind and received an EPV, it can 
invoke operations on the object. For example, to call the push and 
pop operations in the sample trait above: 



epv = (stack_$epv *) trait_$bind(my_obj , stack_$trait , 

&type_uid, &status) ; 

(* (epv->push) ) (my_obj , an_elem) ; 

an_elem = (* (epv->pop) ) (my_obj) ; 

The Domain system provides a set of programs for creating and 
installing new types and their managers. A user who creates a new 
type will also typically write a type manager for that type. The man- 
ager is written as a set of subroutines, each implementing an opera- 
tion for the traits that the manager supports. The programmer can 
use the standard debugging tools on the type manager. The man- 
ager is installed by running a program that puts the executable code 
in a well-known place and registers the new manager with the type 
system data base. No kernel modifications are required, and the 
machine does not have to be rebooted. There is no limit on the 
number of object types a single system may support since their man- 
agers are only loaded when needed. 
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Extensible Streams 

Extensible Streams is a client of TTM. ES defines three basic traits: 
IO, IO_OC, and IO_XOC. The 10 trait contains the traditional I/O 
operations — get (read), put (write), seek, etc. The IO_OC trait 
contains the operations open and initialize. (The IO_XOC trait is 
similar to IO_OC except that it supports extended naming, a facil- 
ity that allows non-standard pathnames, described below.) ES also 
defines a set of auxiliary traits containing operations that only 
some type managers will choose to implement. The current set of 
auxiliary traits include: SIO (operations for manipulating serial I/O 
lines), SOCKET (operations corresponding to the 4.2BSD UNIX 
"socket" system calls), PAD (operations for manipulating win- 
dows), and DIRECTORY (operations for reading and manipulating 
directories) . 

ES introduces a layer of abstraction on top of the basic operations. 
This layer — called the I/O Switch — supports the notion of an 
open stream and isolates the user of file system I/O from the TTM. 
An open stream is created by calling the I/O Switch procedure 
ios_$open which: 

• Calls trait__$bind to get the IO and IO__OC EPVs for the 

object being opened. 

• Calls the manager's open operation. This operation re- 
turns a handle — a virtual address of a descriptor that is 
meaningful only to the manager. The manager stores in 
the handle whatever information it needs in order to main- 
tain the semantics of an open stream (e.g., position in 
stream, buffers). 

• Allocates an entry in the stream table — a table of open 
streams. Each entry in this table contains the EPVs for the 
10 and IO_OC traits, and the handle returned by the open 
operation. 

• Returns the small integer — the file descriptor — that 
identifies the table entry allocated in the previous step. 
This file descriptor is used by the application program on 
subsequent calls. 

Another I/O Switch procedure, ios_$create, is similar to ios_$open 
except that it creates a new object and calls the manager's initialize 
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operation. In addition to returning a handle, the initialize opera- 
tion stores any information it needs to in the newly-created object. 

For each operation in the IO trait, a trivial I/O Switch procedure 
takes a file descriptor as its first argument, converts the descriptor 
to a handle (by consulting the stream table) , and calls the appropri- 
ate procedure from the EPV (also obtained from the stream table) . 
The various forms of I/O (e.g., UNIX I/O system calls, FORTRAN 
and Pascal language I/O primitives) are implemented in terms of 
these I/O Switch procedures. 



Extended Naming 

Extended naming is a facility that allows the pathname of an ob- 
ject being opened to be augmented with additional text to be inter- 
preted by the Streams manager of the object to which the pathname 
refers. This additional text is called the residual pathname. 

If an application calls the I/O Switch's open procedure with a path- 
name containing a residual, and the non-residual part of the path- 
name names an object whose type manager implements the 
IO_XOC trait (as opposed to the IOJ3C trait) , then the I/O Switch 
passes the residual to the manager as one of the arguments in the 
IO_XOC open operation. The manager is free to interpret the re- 
sidual in any way it chooses. 

Program-level I/O based on a simple system naming facility allows 
an application program to pass the name of a file system object into 
the open call, for the I/O Switch to locate the specified object, and 
for the manager of that type of object to then do its job. For exam- 
ple, the pathname /usr/fonts/classic refers to the object whose 
name is classic, which is catalogued in the directory whose name is 
fonts, which in turn is catalogued in a directory object whose name 
is usr. The I/O Switch resolves the entire pathname down into the 
single target object, and passes a shorthand identifier for that object 
to the manager. 

The intent of extended naming is to allow the object managers 
themselves to take over part of the pathname-walking responsibility 
so that they can manage a collection of objects that can be distin- 
guished by the remainder of the pathname. To clarify this notion, 
consider the following. 
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The pathname /jim/test.c would normally be interpreted as a file 
named test.c catalogued in the directory named jim. The name 
also suggests that the file is a C language source file and that all 
operations that would need to work on such a file (e.g., compiling, 
printing, editing) could be requested by specifying this name. 

Now let's suppose that file test.c is of type history. The actual file 
system object contains the entire change history of the file, much 
the same way that a SCCS [9] file does. Programs that do not care 
about the change history can open this file and read from it. The 
open and read requests are passed on by the I/O Switch to the 
history type manager, and the manager can be written so that the 
program always reads the latest version of the file. 

Extended naming takes the concept one step further by allowing the 
manger writer for the history object type to allow the specification 
of additional pathname text. Where the simply specified pathname 
results in the reading of data from the latest version of the file 
test.c, the manager writer might wish to allow a naming syntax of 
the form /jim/test.c/-l to indicate that the application wishes to 
use the penultimate version of the file instead of the newest. The 
I/O Switch allows this additional specification to be issued at the 
application program layer and passed through to the manager for 
the target object. 

The application passes the pathname (with the extended name) to 
the I/O Switch open routine. The open routine evaluates the path- 
name one pathname component at a time walking from left to right. 
In the current example, jim is a directory where the name test.c is 
located, test.c is discovered to be a history file (not a directory), 
and because the original pathname still has remaining text ('-V) 
that the I/O Switch cannot resolve, it passes that remainder to the 
history object manager's IO_XOC open routine. The history man- 
ager is then able to decide what text to provide to subsequent read 
requests and the intended result occurs. In this case, the application 
program is not affected by the apparent peculiarity of the original 
pathname. The I/O Switch avoids confusion by only walking the 
pathname through objects that support the directory trait and the 
manager is able to get whatever information it needs to do the job it 
was written to do. 
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Other examples of extended names a history manager might be will- 
ing to accept are: 

/jim/test.c/03. 02.85 
/jim/test.c/original 
/jim/test. c/yesterday 

Another example of the application of extended naming is a gate- 
way to a non-Domain file system. For example, imagine an object 
whose name is THEM and whose type is UNIX__gate . A path- 
name of the form /gateways/THEM/usr/jan/test.c could be 
passed by an application program to the I/O Switch. The Switch 
would see that the object named gateways was a directory and 
would look the name THEM up in that directory. THEM would be 
found to be a UNIX__gate object, and since the Switch cannot walk 
the pathname through objects that are not directories, it would call 
the UNIX_gate object manager's open routine. That routine is 
passed the UID for the object whose name is THEM and the re- 
maining pathname (/usr/jan/test.c). The UNIX_gate manager 
then has the information it needs to contact a remote file service for 
the data it needs to meet the demands of the requesting application 
program. The protocol that the manager uses to access the remote 
files is entirely up to the manager writer, and because the manager 
runs in user space, it is not restricted to kernel services but can use 
any service available at the user level. This scheme has been used 
to build a type manager that interconnects the Domain file system 
with a generic Berkeley 4.2 UNIX file system, and another that 
connects to a VAX/ VMS file system. 



Underlying Facilities 

Many facilities provided in the Domain environment made the im- 
plementation of TTM and Extensible Streams possible. These fa- 
cilities make it possible to write OS-like functions in user space. 

The underlying virtual memory system — which allows objects to be 
mapped into the virtual address space — is needed to give type 
managers low-level, yet controlled, access to the raw data in ob- 
jects. The virtual memory system allows more flexible access to the 
address space than that allowed by sbr\i(2). These calls take the 
name of an object, map the object into the address space, and re- 
turn a pointer to (i.e., the virtual address of) the mapped object. 
The address space of a process can be characterized solely in terms 
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of what objects are mapped where. Processes are not allowed to 
make memory references to parts of the address space to which no 
object is mapped. 

The read/write storage (RWS) facility is a flexible and efficient stor- 
age allocation mechanism. It is implemented in user space in terms 
of the virtual memory primitives; it maps temporary objects into the 
address space and allocates storage from that part of the address 
space. It allows storage to be allocated from multiple pools. One 
pool corresponds exactly to the type of storage allocated by malloc. 
Another pool is similar, except its state is not obliterated by exec 
calls. Type managers must use storage from this pool to hold per- 
process state information since open streams must survive calls to 
exec. 

RWS also provides a global storage pool. The global pool is a place 
where storage that can be viewed from all processes' address spaces 
can be allocated. The allocation call returns a pointer to the allo- 
cated storage, and this pointer is valid in all processes. Type manag- 
ers must use storage from the global pool to maintain things like the 
current position (i.e., offset from beginning-of-file) of an open 
stream. If a process opens a stream to an object, forks, and then 
the child does I/O to the stream it was passed, the parent sees the 
position of its stream change too. Thus, position information must 
be in storage accessible to both parent and child. Because type 
managers run in user space, they need a user space global storage 
allocater for this purpose. 

The dynamic program loader allows the system to load managers as 
they are needed. Managers for types that are not used by a given 
process do not take up any virtual address space in that process. 
The loader is implemented in user space in terms of the RWS facil- 
ity (to allocate space for static data) and the mapping calls. The 
pure parts of executable images are simply mapped into the address 
space before execution, because the compilers produce position-in- 
dependent code. In 4.2BSD, only the kernel can be dynamically 
linked to; all other subroutines must be statically bound to the pro- 
gram which uses them. 

The eventcount [8] (EC2) facility is the basic process synchroniza- 
tion mechanism. Eventcounts are similar to semaphores: 
eventcounts are associated with significant events, and processes 
can advance an eventcount to notify another process that an event 
has occurred, or wait on a list of eventcounts until the first event 
happens. 
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A design principle for all Domain interfaces is that for every poten- 
tially blocking procedure in an interface, there is an associated 
eventcount that can be obtained through the interface and that is 
advanced when the blocking procedure would have unblocked. 
This always allows programs to wait for multiple events (say, input 
on a TTY line and arrival of a network message) simultaneously. 
The 4.2BSD select^ system call is implemented in terms of 
eventcounts. However, unlike select, eventcounts can also be used 
to wait on non-I/O events, such as process death. 

The mutual exclusion (MUTEX) facility is a user-state library that 
contains calls that allow multiple processes to synchronize their ac- 
cess to shared data (i.e., data in objects that are mapped into multi- 
ple processes). MUTEX is implemented in terms of EC2. MUTEX 
defines a lock record that consists of a lock byte and an 
eventcount. Typically, applications embed a record of this type in a 
data structure over which mutual exclusion must be maintained. A 
MUTEX lock is set by calling mutex_$lock, which attempts to set 
the lock byte (using the hardware test-and-set instruction). If it fails 
to set the lock byte, it waits on the eventcount; when the wait re- 
turns, mutex__$lock repeats the attempt to set the lock byte. 
mutex_$unlock unlocks a MUTEX lock by clearing the lock byte 
and advancing the eventcount. Type managers use shared storage 
to maintain various kinds of information. To control access to this 
data, managers use the MUTEX facility. 

The shared file control block (SFCB) facility allows multiple proc- 
esses to coordinate their access to the same object. There is various 
dynamic information that processes might want to keep about an 
object. For example, type managers need to maintain information 
about the object's current length, whether the object is being ac- 
cessed for read or write, and whether other processes should be 
allowed to concurrently access the object. Since this information 
must be accessed by multiple processes, it must reside in global 
storage. 

The first process to access the object can allocate the storage, but 
how are other processes to find the virtual address of that storage? 
The SFCB facility addresses this problem by maintaining a table 
translating object UID into global virtual address. (The table is in 
global storage at a well-known location.) The sfcb_$get call takes 
an object UID and returns a pointer to a piece of global storage 
(called the SFCB). If no storage was "registered" with SFCB prior 
to the call, an SFCB is allocated and registered under the specified 
UID; otherwise, a pointer to the existing storage associated with 
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that UID is returned and a use count field in the storage is incre- 
mented to reflect the additional "user" of the storage. sfcb $free 
decrements the use count and, if it reaches zero, frees the storage. 



Examples 



Extensible Streams allows a number of special-purpose types to be 
defined. For example: 

• History objects: objects that contain many logical versions, 
only one of which is presented through the open stream at 
a time. The residual text is used to specify a particular ver- 
sion; if omitted, the most recent version is presented. Use- 
ful for source control systems. 

• Circular objects: objects that grow to a certain size and 
then have their "oldest" data discarded when more data is 
written to them. Useful for maintaining bounded log out- 
put from long-running programs. 

• Structured documents: objects that contain document con- 
trol (e.g., font and sectioning) information but which can 
be read through an open stream as if they were simple AS- 
CII text. Useful for using conventional text processing 
tools (e.g., UNIX grep) [10]. 

• Gateways to non-Domain file systems: objects that are 
placeholders for entire remote file systems. The residual is 
used to specify a particular file on the remote system. The 
manager implements whatever network protocol it chooses 
to access the remote system's data. 

• Distributed, replicated data bases: objects that, for reliabil- 
ity reasons, are distributed across a network of machines. 
A Yellow Pages [5] manager would eliminate the need for 
the ypcat command, and allow any ordinary user to access 
a Yellow Pages data base without modification and without 
having to bind to a special library (the type manager, in ef- 
fect, is the library). 

TTM can be used independently of Extensible Streams. For exam- 
ple, the Domain graphics library may be converted to use TTM. 
Currently, the graphics library has code for all the display hardware 
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types it must support. A TTM-based implementation would define 
multiple types, one for each type of display hardware, a trait that 
contains graphics operations (e.g., move, draw, trapezoid Jfill), 
and a set of managers, one per type. This approach would make it 
possible for only the code necessary for a particular display hard- 
ware type to be loaded into the system, and for the graphics library 
to be easily extensible to new hardware types. 



Experience 



While the original Streams library was written with the idea of types 
and type managers in mind, the actual implementation had to be 
restructured substantially to take advantage of TTM. We took this 
opportunity to redesign the interface to managers and the interface 
presented to applications that use the Streams library. 

The decision to implement the Berkeley socket calls in terms of a 
trait turned out to be a good one. Gn a standard Berkeley UNIX 
system, defining and implementing a new domain (address family) 
is a fairly difficult task — it requires working inside the kernel. With 
Extensible Streams, you need only create a new type and imple- 
ment the SOCKET trait in the manager for that type. We have al- 
ready implemented a manager for "Domain domain sockets." Cur- 
rently, this domain supports only datagram-oriented sockets 
(SOCKJDGRAM) because our short-term goal was merely to allow 
access to specific, low-level Domain networking primitives using the 
generic, high-level socket calls. 

The nature of the address family space made our task a bit more 
complicated. Address families are identified by small integers in a 
space over which there is no central authority. As a result, one has 
to simply pick an address family out of thin air and hope no one 
else has picked it too. It is interesting to contrast this state of affairs 
with the type UID approach we took in TTM, since the small inte- 
ger address families are essentially type tags. The type UID ap- 
proach does not have the problem of more than one person picking 
the same type tag. We did not have the option to change the way 
address families are identified, so we used a scheme in which ad- 
dress families are translated into type UIDs. 

The socket creation primitive is called socket_$create_type. This 
calls takes a type UID (and a socket type) and returns a stream to a 
socket of that type. (socket_$create_type is analogous to ios_$cre- 
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ate except that it calls the create operation in the SOCKET trait 
instead of the initialize operation in the IO_OC trait.) The socket 
system call converts its address family argument into a type UID by 
consulting an object in the file system that contains a table translat- 
ing address families into type UID. It then calls socket_$cre- 
ate_type. Note that we could have simply hardcoded a "switch" 
statement on address family into the implemention of socket, but 
this would have meant that socket would not have been as extensi- 
ble as we would like. (User-defined sockets could have been cre- 
ated via socket_$create_type, but not by socket). The scheme we 
implemented is less than ideal in that it requires both that the type 
be created and that the address-family-to-type-UID object be up- 
dated, but it was the best we could do. 

One difficult problem that we have not adequately addressed is that 
of expanding wildcards in an extended name. For example, using 
our VMS gateway type manager, one would like to type the name: 

/gateways /my_vms_sys /draO : [ rees . * ] mai 1 . txt 

If my_vms_sys is a gateway object to a VMS system, and 
draO: [rees. *]mail. txt is a VMS file specification, this specification 
should be expanded to include files named mail. txt in all subdirec- 
tories of draO:[rees]. Unfortunately, the agent doing the wildcard 
expansion (typically the UNIX shell) has no knowledge of the syn- 
tax of the extended part of the name, and so has no way to expand 
the wildcard. We considered implementing a "wildcard trait," but 
this is difficult to specify in a general way, and every program that 
does wildcard expansion would have to be modified to use this trait. 
Instead, we require that standard UNIX hierarchical names with / 
separators be used whenever wildcards are being expanded, but we 
also allow non-standard syntax (as in the example above) if there 
are no wildcards. 

The semantics of certain UNIX operations turned out to be fairly 
obscure. For example, suppose a program sets the F APPEND flag 
(via fcnt\(2)) to "true," then forks, then the child sets the flag to 
"false." Is the change to the stream state seen by the parent as 
well? We were frequently obliged to look at UNIX kernel source or 
to write sample programs and run them on a standard UNIX system 
to answer our questions. As we discuss below, we are led to believe 
that the task of producing exact semantic specification is a forbid- 
ding one. The various UNIX standards committees have their work 
cut out for them if they intend to do a complete job. 
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Another interesting experience gained during the implemention of 
TTM and Extensible Streams relates to the problem of documenta- 
tion. The goal of Extensible Streams is to make it possible for peo- 
ple who are not employees of Apollo Computer to write new type 
managers without having access to Apollo source code. This means 
that the specification of the semantics of the operations must be 
very precise — it must completely characterize the expectations of 
application programs that do I/O. The creation of this specification 
turned out to be a non-trivial task. 
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Reader's Response 



Please take a few minutes to send us the information we need to revise and 
improve our manuals from your point of view. 



Document Title: Domain! OS Design Principles 

Order No.: 014962-A00 

Date of Publication: January, 1989 



What type of user are you? 

System programmer; language 

Applications programmer; language 

System maintenance person 

System Administrator Student 

Manager/Professional Novice 

Technical Professional Other 

How often do you use the Apollo system? 



What additional information would you like the manual to include?^ 



Please list any errors, omissions, or problem areas in the manual by page, 
section, figure, etc. 



Your Name 



Date 



Organization 



Street Address 



City State 



Zip 

No postage necessary if mailed in the U.S. 
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